Making all nbtree entries unique by having heap TIDs participate in comparisons

Started by Peter Geogheganover 7 years ago124 messages

pg@bowt.ie

over 7 years ago

1 attachment(s)

I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2]/messages/by-id/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

Note that the approach to suffix truncation that I've taken isn't even
my preferred approach [3]https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys -- Peter Geoghegan -- it's a medium-term solution that enables
making a heap TID attribute part of the key space, which enables
everything else. Cheap incremental/retail tuple deletion is the real
prize here; don't lose sight of that when looking through my patch. If
we're going to teach nbtree to truncate this new implicit heap TID
attribute, which seems essential, then we might as well teach nbtree
to do suffix truncation of other (user-visible) attributes while we're
at it. This patch isn't a particularly effective implementation of
suffix truncation, because that's not what I'm truly interested in
improving here (plus I haven't even bothered to optimize the logic for
picking a split point in light of suffix truncation).

amcheck
=======

This patch adds amcheck coverage, which seems like essential
infrastructure for developing a feature such as this. Extensive
amcheck coverage gave me confidence in my general approach. The basic
idea, invariant-wise, is to treat truncated attributes (often
including a truncated heap TID attribute in internal pages) as "minus
infinity" attributes, which participate in comparisons if and only if
we reach such attributes before the end of the scan key (a smaller
keysz for the index scan could prevent this). I've generalized the
minus infinity concept that _bt_compare() has always considered as a
special case, extending it to individual attributes. It's actually
possible to remove that old hard-coded _bt_compare() logic with this
patch applied without breaking anything, since we can rely on the
comparison of an explicitly 0-attribute tuple working the same way
(pg_upgrade'd databases will break if we do this, however, so I didn't
go that far).

Note that I didn't change the logic that has _bt_binsrch() treat
internal pages in a special way when tuples compare as equal. We still
need that logic for cases where keysz is less than the number of
indexed columns. It's only possible to avoid this _bt_binsrch() thing
for internal pages when all attributes, including heap TID, were
specified and compared (an insertion scan key has to have an entry for
every indexed column, including even heap TID). Doing better there
doesn't seem worth the trouble of teaching _bt_compare() to tell the
_bt_binsrch() caller about this as a special case. That means that we
still move left on equality in some cases where it isn't strictly
necessary, contrary to L&Y. However, amcheck verifies that the classic
"Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1")
when verifying parent/child relationships, which demonstrates that I
have restored the classic invariant (I just don't find it worthwhile
to take advantage of it within _bt_binsrch() just yet).

Most of this work was done while I was an employee of VMware, though I
joined Crunchy data on Monday and cleaned it up a bit more since then.
I'm excited about joining Crunchy, but I should also acknowledge
VMware's strong support of my work.

[1]: https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
[2]: /messages/by-id/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
[3]: https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patchapplication/octet-stream; name=0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patchDownload

From 0a121d7348c3834b2f20b250f764342f36249b7a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH] Ensure nbtree leaf tuple keys are always unique.

Make comparisons of nbtree index tuples consider heap TID as a
tie-breaker attribute.  Add a separate heap TID attribute to pivot
tuples to make heap TID a first class part of the key space on all
levels of the tree.

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so this patch also adds suffix
truncation of pivot tuples.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though that is only
a secondary goal.

This is a proof of concept patch, which is probably only useful as part
of some much larger effort to add cheap retail index tuple deletion.  It
has several significant issues, which include:

* It fails to deal with on-disk compatibility/pg_upgrade.  It also
slightly reduces the maximum amount of space usable for an index tuple,
in order to reserve room for a possible heap TID in a pivot tuple.
(This reduction in the maximum tuple size may ultimately be deemed
acceptable, and in any case seems impossible to avoid.)

* It regresses performance with some workloads to an extent that's
clearly not acceptable.

* It creates a lot of changes in the regression tests that involve
dependency management of objects/CASCADE.  This is arguably a
preexisting problem, since similar effects are visible when the
regression tests are run with ignore_system_indexes=off [1].

[1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c                    | 219 +++++++++++++++------
 contrib/earthdistance/expected/earthdistance.out   |   2 +-
 contrib/file_fdw/output/file_fdw.source            |  10 +-
 src/backend/access/nbtree/README                   |  70 ++++---
 src/backend/access/nbtree/nbtinsert.c              | 133 ++++++++-----
 src/backend/access/nbtree/nbtpage.c                |   6 +-
 src/backend/access/nbtree/nbtsearch.c              |  96 +++++++--
 src/backend/access/nbtree/nbtsort.c                |  53 +++--
 src/backend/access/nbtree/nbtutils.c               | 155 ++++++++++++---
 src/backend/access/nbtree/nbtxlog.c                |   3 +
 src/backend/utils/sort/tuplesort.c                 |   7 +-
 src/include/access/nbtree.h                        |  71 +++++--
 .../test_extensions/expected/test_extensions.out   |   6 +-
 src/test/regress/expected/aggregates.out           |   4 +-
 src/test/regress/expected/alter_table.out          |  20 +-
 src/test/regress/expected/collate.out              |   2 +-
 src/test/regress/expected/create_type.out          |   8 +-
 src/test/regress/expected/create_view.out          |   2 +-
 src/test/regress/expected/dependency.out           |   4 +-
 src/test/regress/expected/domain.out               |   4 +-
 src/test/regress/expected/event_trigger.out        |  75 ++++---
 src/test/regress/expected/foreign_data.out         |   8 +-
 src/test/regress/expected/foreign_key.out          |   2 +-
 src/test/regress/expected/indexing.out             |  12 +-
 src/test/regress/expected/inherit.out              |  16 +-
 src/test/regress/expected/matview.out              |  18 +-
 src/test/regress/expected/rowsecurity.out          |   4 +-
 src/test/regress/expected/select_into.out          |   4 +-
 src/test/regress/expected/triggers.out             |  16 +-
 src/test/regress/expected/truncate.out             |   4 +-
 src/test/regress/expected/typed_table.out          |  12 +-
 src/test/regress/expected/updatable_views.out      |  28 +--
 src/test/regress/output/tablespace.source          |   8 +-
 src/test/regress/sql/updatable_views.sql           |   2 +
 34 files changed, 731 insertions(+), 353 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..2358bfa94d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,25 +132,27 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+					 OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_g_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							   Page other, int tupnkeyatts, ScanKey key,
+							   ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
@@ -834,8 +843,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +913,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTID(itup);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +950,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +976,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1037,28 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple	righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTID(righttup);
+			}
+
+			if (righttup &&
+				!invariant_g_offset(state, righttupnkeyatts, rightkey,
+									rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1101,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1115,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1130,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1319,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1336,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1385,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1435,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts, targetkey,
+										  scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1782,51 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		rheaptid = BTreeTupleGetHeapTID(ritup);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1835,90 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we need
+	 * to force to be interpreted as negative infinity, since scan key has to
+	 * be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+							 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid = BTreeTupleGetHeapTID(child);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
diff --git a/contrib/earthdistance/expected/earthdistance.out b/contrib/earthdistance/expected/earthdistance.out
index 36a5a7bf6b..0ce34da728 100644
--- a/contrib/earthdistance/expected/earthdistance.out
+++ b/contrib/earthdistance/expected/earthdistance.out
@@ -968,7 +968,7 @@ SELECT abs(cube_distance(ll_to_earth(-30,-90), '(0)'::cube) / earth() - 1) <
 
 drop extension cube;  -- fail, earthdistance requires it
 ERROR:  cannot drop extension cube because other objects depend on it
-DETAIL:  extension earthdistance depends on extension cube
+DETAIL:  extension earthdistance depends on function cube_out(cube)
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 drop extension earthdistance;
 drop type cube;  -- fail, extension cube requires it
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 853c9f9b28..42bf16ba70 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -432,10 +432,10 @@ RESET ROLE;
 DROP EXTENSION file_fdw CASCADE;
 NOTICE:  drop cascades to 7 other objects
 DETAIL:  drop cascades to server file_server
-drop cascades to user mapping for regress_file_fdw_superuser on server file_server
-drop cascades to user mapping for regress_no_priv_user on server file_server
-drop cascades to foreign table agg_text
-drop cascades to foreign table agg_csv
-drop cascades to foreign table agg_bad
 drop cascades to foreign table text_csv
+drop cascades to foreign table agg_bad
+drop cascades to foreign table agg_csv
+drop cascades to foreign table agg_text
+drop cascades to user mapping for regress_no_priv_user on server file_server
+drop cascades to user mapping for regress_file_fdw_superuser on server file_server
 DROP ROLE regress_file_fdw_superuser, regress_file_fdw_user, regress_no_priv_user;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..0782f0129c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -46,18 +46,15 @@ the real "key" at all, just at the link field.)  We can distinguish
 items at the leaf level in the same way, by examining their links to
 heap tuples; we'd never have two items for the same heap tuple.
 
-Lehman and Yao assume that the key range for a subtree S is described
+Lehman and Yao require that the key range for a subtree S is described
 by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+page.  A search that finds exact equality to a bounding key in an upper
+tree level must descend to the left of that key to ensure it finds any
+equal keys.  An insertion that sees the high key of its target page is
+equal to the key to be inserted cannot move right, since the downlink
+for the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity downlink).
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +607,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +659,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 907cce0724..4c4f7d8835 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -180,7 +180,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
+				_bt_compare(rel, indnkeyatts, itup_scankey, &itup->t_tid, page,
 							P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
@@ -216,9 +216,12 @@ top:
 
 	if (!fastpath)
 	{
+		ItemPointer scantid =
+			(checkUnique != UNIQUE_CHECK_NO ? NULL : &itup->t_tid);
+
 		/* find the first page containing this key */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, scantid, false,
+						   &buf, BT_WRITE, NULL);
 
 		/* trade in our read lock for a write lock */
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -231,8 +234,8 @@ top:
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
-							true, stack, BT_WRITE, NULL);
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, scantid,
+							false, true, stack, BT_WRITE, NULL);
 	}
 
 	/*
@@ -261,7 +264,8 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		/* Find position while excluding heap TID attribute */
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL, false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -285,6 +289,25 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/*
+		 * Be careful to not get confused about user attribute position and
+		 * insertion position.
+		 *
+		 * XXX: This is ugly as sin, and clearly needs a lot more work.  While
+		 * not having this code does not seem to affect regression tests, we
+		 * almost certainly need to do something here for the case where
+		 * _bt_check_unique() traverses many pages, each filled with logical
+		 * duplicates.
+		 */
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, &itup->t_tid,
+							false, true, stack, BT_WRITE, NULL);
+		/*
+		 * Always invalidate hint
+		 *
+		 * FIXME: This is unacceptable.
+		 */
+		offset = InvalidOffsetNumber;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -564,11 +587,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -700,6 +723,18 @@ _bt_findinsertloc(Relation rel,
 	 * pages).  Currently the probability of moving right is set at 0.99,
 	 * which may seem too high to change the behavior much, but it does an
 	 * excellent job of preventing O(N^2) behavior with many equal keys.
+	 *
+	 * TODO: Support this old approach for pre-pg_upgrade indexes.
+	 *
+	 * None of this applies when all items in the tree are unique, since the
+	 * new item cannot go on either page if it's equal to the high key.  The
+	 * original L&Y invariant that we now follow is that high keys must be
+	 * less than or equal to all items on the page, and strictly less than
+	 * the right sibling items (since the high key also becomes the downlink
+	 * to the right sibling in parent after a page split).  It's very
+	 * unlikely that it will be equal anyway, since there will be explicit
+	 * heap TIDs in pivot tuples in the event of many duplicates, but it can
+	 * happen when heap TID recycling takes place.
 	 *----------
 	 */
 	movedright = false;
@@ -731,8 +766,7 @@ _bt_findinsertloc(Relation rel,
 		 * nope, so check conditions (b) and (c) enumerated above
 		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
@@ -792,7 +826,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -851,11 +885,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -1143,8 +1178,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1214,7 +1247,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1247,25 +1282,35 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate attributes of the high key item before inserting it on the left
+	 * page.  This can only happen at the leaf level, since in general all
+	 * pivot tuple values originate from leaf level high keys.  This isn't just
+	 * about avoiding unnecessary work, though; truncating unneeded key suffix
+	 * attributes can only be performed at the leaf level anyway.  This is
+	 * because a pivot tuple in a grandparent page must guide a search not only
+	 * to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
+	 *
+	 * TODO: Give a little weight to how large the final downlink will be when
+	 * deciding on a split point.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber	lastleftoffnum = OffsetNumberPrev(firstright);
+
+		lefthikey = _bt_suffix_truncate(rel, origpage, lastleftoffnum , item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1487,22 +1532,11 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
+		loglhikey = true;
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -2210,7 +2244,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2322,8 +2357,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2337,12 +2372,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a24e64156a..25b24b1d66 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1415,8 +1415,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				itup_scankey = _bt_mkscankey(rel, targetkey);
 				/* find the leftmost leaf page containing this key */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0bcfa10b86..8b70f34f73 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -94,8 +94,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * any incomplete splits encountered during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 
@@ -130,7 +130,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  BT_READ, snapshot);
 
@@ -144,7 +144,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid, nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -215,6 +215,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -275,7 +276,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,6 +325,7 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			bool nextkey)
 {
 	Page		page;
@@ -371,7 +373,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -428,24 +430,36 @@ int32
 _bt_compare(Relation rel,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	ItemPointer  heapTid;
 	IndexTuple	itup;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
 
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -459,7 +473,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -510,8 +525,67 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+	return ItemPointerCompare(scantid, heapTid);
+}
+
+/*
+ * Return how many attributes to leave when truncating.
+ *
+ * This only considers key attributes, since non-key attributes should always
+ * be truncated away.  We only need attributes up to and including the first
+ * distinguishing attribute.
+ *
+ * This can return a number of attributes that is one greater than the number
+ * of key attributes actually found in the first right tuple.  This indicates
+ * that the caller must use the leftmost heap TID as a unique-ifier in its new
+ * high key tuple.
+ */
+int
+_bt_leave_natts(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+				IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (untruncated) case, since caller cares about whether
+	 * or not it can avoid appending a heap TID as a unique-ifier
+	 */
+	for (leavenatts = 1; leavenatts <= nkeyatts; leavenatts++)
+	{
+		if (_bt_compare(rel, leavenatts, skey, NULL, leftpage, lastleftoffnum) > 0)
+			break;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
 }
 
 /*
@@ -1027,7 +1101,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
+	stack = _bt_search(rel, keysCount, scankeys, NULL, nextkey, &buf, BT_READ,
 					   scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
@@ -1057,7 +1131,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, NULL, nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e012df596e..07324d471d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -805,8 +805,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -889,17 +887,17 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
-			IndexTuple	truncated;
-			Size		truncsz;
+			OffsetNumber	lastleftoffnum = OffsetNumberPrev(last_off);
+			IndexTuple		truncated;
+			Size			truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
-			 * in internal pages are either negative infinity items, or get
-			 * their contents from copying from one level down.  See also:
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks in
+			 * internal pages are either negative infinity items, or get their
+			 * contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
 			 * Since the truncated tuple is probably smaller than the
@@ -913,8 +911,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
+			 *
+			 * TODO: Give a little weight to how large the final downlink will
+			 * be when deciding on a split point.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			truncated = _bt_suffix_truncate(wstate->index, opage,
+											lastleftoffnum, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -933,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -979,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1038,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1136,6 +1140,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1143,7 +1149,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1160,6 +1165,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all keys
+				 * in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index acb944357a..b9f9883bdd 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,27 +56,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +103,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple
+		 * due to suffix truncation.  Keys built from truncated attributes
+		 * are defensively represented as NULL values, though they should
+		 * still not be allowed to participate in comparisons (caller must
+		 * be sure to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2104,116 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
  * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * attributes copied from caller's itup argument.  If rel is an INCLUDE index,
+ * non-key attributes are always truncated away, since they're not part of the
+ * key space, and are not used in pivot tuples.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes after an already distinct pair of attributes.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Truncated tuple is guaranteed to be no larger than the original plus space
+ * for an extra heap TID tie-breaker attribute, which is important for staying
+ * under the 1/3 of a page restriction on tuple size.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated key
+ * attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+					IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc		itupdesc = RelationGetDescr(rel);
+	int16			natts = IndexRelationGetNumberOfAttributes(rel);
+	int16			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int				leavenatts;
+	IndexTuple		pivot;
+	ItemId			lastleftitem;
+	IndexTuple		lastlefttuple;
+	Size			newsize;
+
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, leftpage, lastleftoffnum, firstright);
+
+	if (leavenatts <= natts)
+	{
+		IndexTuple		tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where we
+		 * can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
 
-	return truncated;
+		/*
+		 * Only non-key attributes could be truncated away.  They are not
+		 * considered part of the key space, so it's still necessary to add a
+		 * heap TID attribute to the new pivot tuple.  Create enlarged copy of
+		 * truncated right tuple copy, to fit heap TID.
+		 */
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible. Create enlarged copy of first right
+		 * tuple, to fit heap TID
+		 */
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We must use heap TID as a unique-ifier in new pivot tuple, since no user
+	 * key attributes could be truncated away.  The heap TID must comes from
+	 * the last tuple on the left page, since new downlinks must be a strict
+	 * lower bound on new right page.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+	/* Copy last left item's heap TID into new pivot tuple */
+	lastleftitem = PageGetItemId(leftpage, lastleftoffnum);
+	lastlefttuple = (IndexTuple) PageGetItem(leftpage, lastleftitem);
+	memcpy((char *) pivot + newsize - MAXALIGN(sizeof(ItemPointerData)),
+		   &lastlefttuple->t_tid, sizeof(ItemPointerData));
+	/* Tuple has all key attributes */
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetHeapTID(pivot);
+	return pivot;
 }
 
 /*
@@ -2137,6 +2236,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2256,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2266,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2277,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2310,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..f1d286c1ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -252,6 +252,9 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	 * When the high key isn't present is the wal record, then we assume it to
 	 * be equal to the first key on the right page.  It must be from the leaf
 	 * level.
+	 *
+	 * FIXME:  We current always log the high key.  Is it worth trying to
+	 * salvage case where logging isn't strictly necessary, and can be avoided?
 	 */
 	if (!lhighkey)
 	{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9035..6e8a216718 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required
+	 * for btree indexes, since heap TID is treated as an implicit last
+	 * key attribute in order to ensure that all keys in the index are
+	 * physically unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..f6208132b3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,11 +122,21 @@ typedef struct BTMetaPageData
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
-#define BTMaxItemSize(page) \
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -204,12 +214,10 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.
  *
  * The 12 least significant offset bits are used to represent the number of
  * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
@@ -219,6 +227,8 @@ typedef struct BTMetaPageData
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented in pivot tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +251,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +268,32 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get/set implicit tie-breaker heap-TID attribute, if any.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has t_tid that
+ * points to heap.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &(itup)->t_tid \
+	)
+#define BTreeTupleSetHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -560,15 +593,17 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int _bt_leave_natts(Relation rel, Page leftpage,
+						   OffsetNumber lastleftoffnum, IndexTuple firstright);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -601,7 +636,9 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, Page leftpage,
+									  OffsetNumber lastleftoffnum,
+									  IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/test/modules/test_extensions/expected/test_extensions.out b/src/test/modules/test_extensions/expected/test_extensions.out
index 28d86c4b87..29b4ec95c1 100644
--- a/src/test/modules/test_extensions/expected/test_extensions.out
+++ b/src/test/modules/test_extensions/expected/test_extensions.out
@@ -30,10 +30,10 @@ NOTICE:  installing required extension "test_ext_cyclic2"
 ERROR:  cyclic dependency detected between extensions "test_ext_cyclic1" and "test_ext_cyclic2"
 DROP SCHEMA test_ext CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to extension test_ext3
-drop cascades to extension test_ext5
-drop cascades to extension test_ext2
+DETAIL:  drop cascades to extension test_ext5
 drop cascades to extension test_ext4
+drop cascades to extension test_ext3
+drop cascades to extension test_ext2
 drop cascades to extension test_ext1
 CREATE EXTENSION test_ext6;
 DROP EXTENSION test_ext6;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f85e913850..4f0089a7cf 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -942,9 +942,9 @@ select distinct min(f1), max(f1) from minmaxtest;
 
 drop table minmaxtest cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table minmaxtest1
+DETAIL:  drop cascades to table minmaxtest3
 drop cascades to table minmaxtest2
-drop cascades to table minmaxtest3
+drop cascades to table minmaxtest1
 -- check for correct detection of nested-aggregate errors
 select max(min(unique1)) from tenk1;
 ERROR:  aggregate function calls cannot be nested
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 702bf9fe98..c23826f294 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2672,19 +2672,19 @@ select alter2.plus1(41);
 -- clean up
 drop schema alter2 cascade;
 NOTICE:  drop cascades to 13 other objects
-DETAIL:  drop cascades to table alter2.t1
-drop cascades to view alter2.v1
-drop cascades to function alter2.plus1(integer)
-drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
+DETAIL:  drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
-drop cascades to conversion alter2.ascii_to_utf8
-drop cascades to text search parser alter2.prs
-drop cascades to text search configuration alter2.cfg
-drop cascades to text search template alter2.tmpl
-drop cascades to text search dictionary alter2.dict
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to type alter2.posint
+drop cascades to function alter2.plus1(integer)
+drop cascades to table alter2.t1
+drop cascades to view alter2.v1
 --
 -- composite types
 --
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index fcbe3a5cc8..95c73357a4 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -668,4 +668,4 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 --
 \set VERBOSITY terse
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+NOTICE:  drop cascades to 20 other objects
diff --git a/src/test/regress/expected/create_type.out b/src/test/regress/expected/create_type.out
index 2f7d5f94d7..8309756030 100644
--- a/src/test/regress/expected/create_type.out
+++ b/src/test/regress/expected/create_type.out
@@ -161,13 +161,13 @@ DROP FUNCTION base_fn_out(opaque); -- error
 ERROR:  function base_fn_out(opaque) does not exist
 DROP TYPE base_type; -- error
 ERROR:  cannot drop type base_type because other objects depend on it
-DETAIL:  function base_fn_out(base_type) depends on type base_type
-function base_fn_in(cstring) depends on type base_type
+DETAIL:  function base_fn_in(cstring) depends on type base_type
+function base_fn_out(base_type) depends on type base_type
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE base_type CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to function base_fn_out(base_type)
-drop cascades to function base_fn_in(cstring)
+DETAIL:  drop cascades to function base_fn_in(cstring)
+drop cascades to function base_fn_out(base_type)
 -- Check usage of typmod with a user-defined type
 -- (we have borrowed numeric's typmod functions)
 CREATE TEMP TABLE mytab (foo widget(42,13,7));     -- should fail
diff --git a/src/test/regress/expected/create_view.out b/src/test/regress/expected/create_view.out
index 141fc6da62..8abcd7b3d9 100644
--- a/src/test/regress/expected/create_view.out
+++ b/src/test/regress/expected/create_view.out
@@ -1711,4 +1711,4 @@ select pg_get_ruledef(oid, true) from pg_rewrite
 DROP SCHEMA temp_view_test CASCADE;
 NOTICE:  drop cascades to 27 other objects
 DROP SCHEMA testviewschm2 CASCADE;
-NOTICE:  drop cascades to 62 other objects
+NOTICE:  drop cascades to 63 other objects
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a9041b0..49fb047d99 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -645,8 +645,8 @@ alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+DETAIL:  drop cascades to column col2 of table domnotnull
+drop cascades to column col1 of table domnotnull
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 008e859d4c..17514e6198 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -152,9 +152,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
@@ -241,14 +241,13 @@ CREATE EVENT TRIGGER regress_event_trigger_drop_objects ON sql_drop
 ALTER TABLE schema_one.table_one DROP COLUMN a;
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
 ERROR:  object audit_tbls.schema_two_table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
@@ -257,61 +256,61 @@ PL/pgSQL function test_evtrig_dropped_objects() line 8 at EXECUTE
 DELETE FROM undroppable_objs WHERE object_identity = 'audit_tbls.schema_two_table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 ERROR:  object schema_one.table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
 DELETE FROM undroppable_objs WHERE object_identity = 'schema_one.table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 SELECT * FROM dropped_objects WHERE schema IS NULL OR schema <> 'pg_toast';
      type     |   schema   |               object                
 --------------+------------+-------------------------------------
  table column | schema_one | schema_one.table_one.a
  schema       |            | schema_two
- table        | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two[]
+ function     | schema_two | schema_two.add(integer,integer)
+ aggregate    | schema_two | schema_two.newton(integer)
  table        | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three[]
  table        | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three[]
- function     | schema_two | schema_two.add(integer,integer)
- aggregate    | schema_two | schema_two.newton(integer)
+ table        | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two[]
  schema       |            | schema_one
- table        | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one[]
- table        | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"[]
  table        | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three[]
+ table        | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"[]
+ table        | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one[]
 (23 rows)
 
 DROP OWNED BY regress_evt_user;
@@ -360,13 +359,13 @@ DROP INDEX evttrig.one_idx;
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={}
 DROP SCHEMA evttrig CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to table evttrig.one
-drop cascades to table evttrig.two
+DETAIL:  drop cascades to table evttrig.two
+drop cascades to table evttrig.one
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=schema identity=evttrig name={evttrig} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.one name={evttrig,one} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=sequence identity=evttrig.one_col_a_seq name={evttrig,one_col_a_seq} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=default value identity=for evttrig.one.col_a name={evttrig,one,col_a} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
 DROP TABLE a_temp_tbl;
 NOTICE:  NORMAL: orig=t normal=f istemp=t type=table identity=pg_temp.a_temp_tbl name={pg_temp,a_temp_tbl} args={}
 DROP EVENT TRIGGER regress_event_trigger_report_dropped;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 339a43ff9e..b5c78016dc 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -2049,9 +2049,9 @@ DROP TABLE fd_pt2;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index b90c4926e2..368cc451fc 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -255,7 +255,7 @@ SELECT * FROM FKTABLE;
 -- this should fail for lack of CASCADE
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
+DETAIL:  constraint constrname2 on table fktable depends on index pktable_pkey
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 2c2bf44aa8..b3bddc089d 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -101,8 +101,8 @@ create table idxpart (a int) partition by range (a);
 create index on idxpart (a);
 create table idxpart1 partition of idxpart for values from (0) to (10);
 drop index idxpart1_a_idx;	-- no way
-ERROR:  cannot drop index idxpart1_a_idx because index idxpart_a_idx requires it
-HINT:  You can drop index idxpart_a_idx instead.
+ERROR:  cannot drop index idxpart1_a_idx because column a of table idxpart1 requires it
+HINT:  You can drop column a of table idxpart1 instead.
 drop index idxpart_a_idx;	-- both indexes go away
 select relname, relkind from pg_class
   where relname like 'idxpart%' order by relname;
@@ -931,11 +931,11 @@ select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid
 (3 rows)
 
 drop index idxpart0_pkey;								-- fail
-ERROR:  cannot drop index idxpart0_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart0_pkey because constraint idxpart0_pkey on table idxpart0 requires it
+HINT:  You can drop constraint idxpart0_pkey on table idxpart0 instead.
 drop index idxpart1_pkey;								-- fail
-ERROR:  cannot drop index idxpart1_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart1_pkey because constraint idxpart1_pkey on table idxpart1 requires it
+HINT:  You can drop constraint idxpart1_pkey on table idxpart1 instead.
 alter table idxpart0 drop constraint idxpart0_pkey;		-- fail
 ERROR:  cannot drop inherited constraint "idxpart0_pkey" of relation "idxpart0"
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index b2b912ed5c..392272325e 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -952,8 +952,8 @@ Inherits: c1,
 
 drop table p1 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table c1
-drop cascades to table c2
+DETAIL:  drop cascades to table c2
+drop cascades to table c1
 drop cascades to table c3
 drop table p2 cascade;
 create table pp1 (f1 int);
@@ -1098,9 +1098,9 @@ SELECT a.attrelid::regclass, a.attname, a.attinhcount, e.expected
 
 DROP TABLE inht1, inhs1 CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table inht2
+DETAIL:  drop cascades to table inht3
+drop cascades to table inht2
 drop cascades to table inhts
-drop cascades to table inht3
 drop cascades to table inht4
 -- Test non-inheritable indices [UNIQUE, EXCLUDE] constraints
 CREATE TABLE test_constraints (id int, val1 varchar, val2 int, UNIQUE(val1, val2));
@@ -1352,8 +1352,8 @@ select * from patest0 join (select f1 from int4_tbl limit 1) ss on id = f1;
 
 drop table patest0 cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to table patest1
-drop cascades to table patest2
+DETAIL:  drop cascades to table patest2
+drop cascades to table patest1
 --
 -- Test merge-append plans for inheritance trees
 --
@@ -1499,9 +1499,9 @@ reset enable_seqscan;
 reset enable_parallel_append;
 drop table matest0 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table matest1
+DETAIL:  drop cascades to table matest3
 drop cascades to table matest2
-drop cascades to table matest3
+drop cascades to table matest1
 --
 -- Check that use of an index with an extraneous column doesn't produce
 -- a plan with extraneous sorting
diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index 08cd4bea48..dc7878454d 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -310,15 +310,15 @@ SELECT type, m.totamt AS mtot, v.totamt AS vtot FROM mvtest_tm m LEFT JOIN mvtes
 -- make sure that dependencies are reported properly when they block the drop
 DROP TABLE mvtest_t;
 ERROR:  cannot drop table mvtest_t because other objects depend on it
-DETAIL:  view mvtest_tv depends on table mvtest_t
+DETAIL:  materialized view mvtest_tm depends on table mvtest_t
+materialized view mvtest_tmm depends on materialized view mvtest_tm
+view mvtest_tv depends on table mvtest_t
 view mvtest_tvv depends on view mvtest_tv
 materialized view mvtest_tvvm depends on view mvtest_tvv
 view mvtest_tvvmv depends on materialized view mvtest_tvvm
 materialized view mvtest_bb depends on view mvtest_tvvmv
 materialized view mvtest_mvschema.mvtest_tvm depends on view mvtest_tv
 materialized view mvtest_tvmm depends on materialized view mvtest_mvschema.mvtest_tvm
-materialized view mvtest_tm depends on table mvtest_t
-materialized view mvtest_tmm depends on materialized view mvtest_tm
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- make sure dependencies are dropped and reported
 -- and make sure that transactional behavior is correct on rollback
@@ -326,15 +326,15 @@ HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 BEGIN;
 DROP TABLE mvtest_t CASCADE;
 NOTICE:  drop cascades to 9 other objects
-DETAIL:  drop cascades to view mvtest_tv
+DETAIL:  drop cascades to materialized view mvtest_tm
+drop cascades to materialized view mvtest_tmm
+drop cascades to view mvtest_tv
 drop cascades to view mvtest_tvv
 drop cascades to materialized view mvtest_tvvm
 drop cascades to view mvtest_tvvmv
 drop cascades to materialized view mvtest_bb
 drop cascades to materialized view mvtest_mvschema.mvtest_tvm
 drop cascades to materialized view mvtest_tvmm
-drop cascades to materialized view mvtest_tm
-drop cascades to materialized view mvtest_tmm
 ROLLBACK;
 -- some additional tests not using base tables
 CREATE VIEW mvtest_vt1 AS SELECT 1 moo;
@@ -484,10 +484,10 @@ SELECT * FROM mvtest_mv_v_4;
 
 DROP TABLE mvtest_v CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to materialized view mvtest_mv_v
-drop cascades to materialized view mvtest_mv_v_2
+DETAIL:  drop cascades to materialized view mvtest_mv_v_4
 drop cascades to materialized view mvtest_mv_v_3
-drop cascades to materialized view mvtest_mv_v_4
+drop cascades to materialized view mvtest_mv_v_2
+drop cascades to materialized view mvtest_mv_v
 -- Check that unknown literals are converted to "text" in CREATE MATVIEW,
 -- so that we don't end up with unknown-type columns.
 CREATE MATERIALIZED VIEW mv_unspecified_types AS
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index f1ae40df61..8d99213705 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/expected/select_into.out b/src/test/regress/expected/select_into.out
index 942f975e95..30e10b12d2 100644
--- a/src/test/regress/expected/select_into.out
+++ b/src/test/regress/expected/select_into.out
@@ -46,9 +46,9 @@ CREATE TABLE selinto_schema.tmp3 (a,b,c)
 RESET SESSION AUTHORIZATION;
 DROP SCHEMA selinto_schema CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table selinto_schema.tmp1
+DETAIL:  drop cascades to table selinto_schema.tmp3
 drop cascades to table selinto_schema.tmp2
-drop cascades to table selinto_schema.tmp3
+drop cascades to table selinto_schema.tmp1
 DROP USER regress_selinto_user;
 -- Tests for WITH NO DATA and column name consistency
 CREATE TABLE ctas_base (i int, j int);
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index bf271d536e..ac55ac5be9 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -558,9 +558,9 @@ LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
 -- check dependency restrictions
 ALTER TABLE main_table DROP COLUMN b;
 ERROR:  cannot drop column b of table main_table because other objects depend on it
-DETAIL:  trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+DETAIL:  trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
 trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
-trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- this should succeed, but we'll roll it back to keep the triggers around
 begin;
@@ -1886,14 +1886,14 @@ select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
 (4 rows)
 
 drop trigger trg1 on trigpart1;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
 drop trigger trg1 on trigpart2;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart2 because table trigpart2 requires it
+HINT:  You can drop table trigpart2 instead.
 drop trigger trg1 on trigpart3;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart3 because table trigpart3 requires it
+HINT:  You can drop table trigpart3 instead.
 drop table trigpart2;			-- ok, trigger should be gone in that partition
 select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
   where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 735d0e862d..e81fa8e0c6 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -278,9 +278,9 @@ SELECT * FROM trunc_faa;
 ROLLBACK;
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
+DETAIL:  drop cascades to table trunc_fb
+drop cascades to table trunc_fa
 drop cascades to table trunc_faa
-drop cascades to table trunc_fb
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..5834becb07 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -77,17 +77,17 @@ CREATE TABLE persons4 OF person_type (
 ERROR:  column "name" specified more than once
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
+DETAIL:  table persons3 depends on type person_type
 table persons2 depends on type person_type
-table persons3 depends on type person_type
+function get_all_persons() depends on type person_type
+table persons depends on type person_type
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
+DETAIL:  drop cascades to table persons3
 drop cascades to table persons2
-drop cascades to table persons3
+drop cascades to function get_all_persons()
+drop cascades to table persons
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index b34bab4b29..1a7b3b4f14 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
@@ -1056,8 +1042,8 @@ SELECT * FROM base_tbl;
 RESET SESSION AUTHORIZATION;
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+DETAIL:  drop cascades to view rw_view2
+drop cascades to view rw_view1
 -- nested-view permissions
 CREATE TABLE base_tbl(a int, b text, c float);
 INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
@@ -1442,8 +1428,8 @@ SELECT events & 4 != 0 AS upd,
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
 DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
 drop cascades to view rw_view3
+drop cascades to view rw_view2
 -- inheritance tests
 CREATE TABLE base_tbl_parent (a int);
 CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
@@ -1542,8 +1528,8 @@ SELECT * FROM base_tbl_child ORDER BY a;
 
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+DETAIL:  drop cascades to view rw_view2
+drop cascades to view rw_view1
 -- simple WITH CHECK OPTION
 CREATE TABLE base_tbl (a int, b int DEFAULT 10);
 INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
diff --git a/src/test/regress/output/tablespace.source b/src/test/regress/output/tablespace.source
index 24435118bc..ddeaa72205 100644
--- a/src/test/regress/output/tablespace.source
+++ b/src/test/regress/output/tablespace.source
@@ -242,10 +242,10 @@ NOTICE:  no matching relations in tablespace "regress_tblspace_renamed" found
 DROP TABLESPACE regress_tblspace_renamed;
 DROP SCHEMA testschema CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to table testschema.foo
-drop cascades to table testschema.asselect
-drop cascades to table testschema.asexecute
+DETAIL:  drop cascades to table testschema.tablespace_acl
 drop cascades to table testschema.atable
-drop cascades to table testschema.tablespace_acl
+drop cascades to table testschema.asexecute
+drop cascades to table testschema.asselect
+drop cascades to table testschema.foo
 DROP ROLE regress_tablespace_user1;
 DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index a7786b26e9..deb76546aa 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
-- 
2.14.1

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Peter Geoghegan (#1)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I've been thinking about using heap TID as a tie-breaker when
comparing B-Tree index tuples for a while now [1]. I'd like to make
all tuples at the leaf level unique, as assumed by L&Y. This can
enable "retail index tuple deletion", which I think we'll probably end
up implementing in some form or another, possibly as part of the zheap
project. It's also possible that this work will facilitate GIN-style
deduplication based on run length encoding of TIDs, or storing
versioned heap TIDs in an out-of-line nbtree-versioning structure
(unique indexes only). I can see many possibilities, but we have to
start somewhere.

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

It could also be extremely useful in some workloads with the regular
heap. If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members.

Can you enumerate some of the technical obstacles that you see?

On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

I think it would be helpful if you could talk more about these
regressions (and the wins).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Robert Haas (#2)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Yes, retail index deletion is essential for the delete-marking
approach that is proposed for zheap.

Makes sense.

I don't know that much about zheap. I'm sure that retail index tuple
deletion is really important in general, though. The Gray & Reuter
book treats unique keys as a basic assumption, as do other
authoritative reference works and papers. Other database systems
probably make unique indexes simply use the user-visible attributes as
unique values, but appending heap TID as a unique-ifier is probably a
reasonably common design for secondary indexes (it would also be nice
if we could simply not store duplicates for unique indexes, rather
than using heap TID). I generally have a very high opinion on the
nbtree code, but this seems like a problem that ought to be fixed.

I've convinced myself that I basically have the right idea with this
patch, because the classic L&Y invariants have all been tested with an
enhanced amcheck run against all indexes in a regression test
database. There was other stress-testing, too. The remaining problems
are fixable, but I need some guidance.

It could also be extremely useful in some workloads with the regular
heap. If the indexes are large -- say, 100GB -- and the number of
tuples that vacuum needs to kill is small -- say, 5 -- scanning them
all to remove the references to those tuples is really inefficient.
If we had retail index deletion, then we could make a cost-based
decision about which approach to use in a particular case.

I remember talking to Andres about this in a bar 3 years ago. I can
imagine variations of pruning that do some amount of this when there
are lots of duplicates. Perhaps something like InnoDB's purge threads,
which do things like in-place deletes of secondary indexes after an
updating (or deleting) xact commits. I believe that that mechanism
targets secondary indexes specifically, and that is operates quite
eagerly.

Can you enumerate some of the technical obstacles that you see?

The #1 technical obstacle is that I simply don't know where I should
try to take this patch, given that it probably needs to be tied to
some much bigger project, such as zheap. I have an open mind, though,
and intend to help if I can. I'm not really sure what the #2 and #3
problems are, because I'd need to be able to see a few steps ahead to
be sure. Maybe #2 is that I'm doing something wonky to avoid breaking
duplicate checking for unique indexes. (The way that unique duplicate
checking has always worked [1]https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement -- Peter Geoghegan is perhaps questionable, though.)

I think it would be helpful if you could talk more about these
regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

[1]: https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement -- Peter Geoghegan
--
Peter Geoghegan

Claudio Freire

klaussfreire@gmail.com

over 7 years ago

In reply to: Peter Geoghegan (#3)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think it would be helpful if you could talk more about these
regressions (and the wins).

I think that the performance regressions are due to the fact that when
you have a huge number of duplicates today, it's useful to be able to
claim space to fit further duplicates from almost any of the multiple
leaf pages that contain or have contained duplicates. I'd hoped that
the increased temporal locality that the patch gets would more than
make up for that. As far as I can tell, the problem is that temporal
locality doesn't help enough. I saw that performance was somewhat
improved with extreme Zipf distribution contention, but it went the
other way with less extreme contention. The details are not that fresh
in my mind, since I shelved this patch for a while following limited
performance testing.

The code could certainly use more performance testing, and more
general polishing. I'm not strongly motivated to do that right now,
because I don't quite see a clear path to making this patch useful.
But, as I said, I have an open mind about what the next step should
be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Claudio Freire (#4)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

--
Peter Geoghegan

Claudio Freire

klaussfreire@gmail.com

over 7 years ago

In reply to: Peter Geoghegan (#5)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits. Maybe we could justify
having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
they're visible to anyone, and if not recycle), because we at least
know that the deleting transaction committed there. That is, they
could be recently dead or dead, and it may be worth going to the extra
trouble of checking which when we know that it's one of the two
possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Peter Geoghegan (#5)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

That sounds like a mechanism that works a bit like
_bt_vacuum_one_page(), which we run at the last second before a page
split. We do this to see if a page split that looks necessary can
actually be avoided.

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits.

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates. If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Amit Kapila (#7)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits.

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates. If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

--
Peter Geoghegan

Amit Kapila

amit.kapila16@gmail.com

over 7 years ago

In reply to: Peter Geoghegan (#8)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I imagine that retail index tuple deletion (the whole point of this
project) would be run by a VACUUM-like process that kills tuples that
are dead to everyone. Even with something like zheap, you cannot just
delete index tuples until you establish that they're truly dead. I
guess that the delete marking stuff that Robert mentioned marks tuples
as dead when the deleting transaction commits.

No, I don't think that is the case because we want to perform in-place
updates for indexed-column-updates. If we won't delete-mark the index
tuple before performing in-place update, then we will have two tuples
in the index which point to the same heap-TID.

How can an old MVCC snapshot that needs to find the heap tuple using
some now-obsolete key values get to the heap tuple via an index scan
if there are no index tuples that stick around until "recently dead"
heap tuples become "fully dead"? How can you avoid keeping around both
old and new index tuples at the same time?

Both values will be present in the index, but the old value will be
delete-marked. It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked. See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Amit Kapila (#9)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Both values will be present in the index, but the old value will be
delete-marked. It is correct that we can't remove the value (index
tuple) from the index until it is truly dead (not visible to anyone),
but during a delete or index-update operation, we need to traverse the
index to mark the entries as delete-marked. See, at this stage, I
don't want to go in too much detail discussion of how delete-marking
will happen in zheap and also I am not sure this thread is the right
place to discuss details of that technology.

I don't understand, but okay. I can provide feedback once a design for
delete marking is available.

--
Peter Geoghegan

#11

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#1)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I attach an unfinished prototype of suffix truncation, that also
sometimes *adds* a new attribute in pivot tuples. It adds an extra
heap TID from the leaf level when truncating away non-distinguishing
attributes during a leaf page split, though only when it must. The
patch also has nbtree treat heap TID as a first class part of the key
space of the index. Claudio wrote a patch that did something similar,
though without the suffix truncation part [2] (I haven't studied his
patch, to be honest). My patch is actually a very indirect spin-off of
Anastasia's covering index patch, and I want to show what I have in
mind now, while it's still swapped into my head. I won't do any
serious work on this project unless and until I see a way to implement
retail index tuple deletion, which seems like a multi-year project
that requires the buy-in of multiple senior community members. On its
own, my patch regresses performance unacceptably in some workloads,
probably due to interactions with kill_prior_tuple()/LP_DEAD hint
setting, and interactions with page space management when there are
many "duplicates" (it can still help performance in some pgbench
workloads with non-unique indexes, though).

I attach a revised version, which is still very much of prototype
quality, but manages to solve a few of the problems that v1 had.
Andrey Lepikhov (CC'd) asked me to post any improved version I might
have for use with his retail index tuple deletion patch, so I thought
I'd post what I have.

The main development for v2 is that the sort order of the implicit
heap TID attribute is flipped. In v1, it was in "ascending" order. In
v2, comparisons of heap TIDs are inverted to make the attribute order
"descending". This has a number of advantages:

* It's almost consistent with the current behavior when there are
repeated insertions of duplicates. Currently, this tends to result in
page splits of the leftmost leaf page among pages that mostly consist
of the same duplicated value. This means that the destabilizing impact
on DROP SCHEMA ... CASCADE regression test output noted before [1]/messages/by-id/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan is
totally eliminated. There is now only a single trivial change to
regression test "expected" files, whereas in v1 dozens of "expected"
files had to be changed, often resulting in less useful reports for
the user.

* The performance regression I observed with various pgbench workloads
seems to have gone away, or is now within the noise range. A patch
like this one requires a lot of validation and testing, so this should
be taken with a grain of salt.

I may have been too quick to give up on my original ambition of
writing a stand-alone patch that can be justified entirely on its own
merits, without being tied to some much more ambitious project like
retail index tuple deletion by VACUUM, or zheap's deletion marking. I
still haven't tried to replace the kludgey handling of unique index
enforcement, even though that would probably have a measurable
additional performance benefit. I think that this patch could become
an unambiguous win.

[1]: /messages/by-id/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patchapplication/octet-stream; name=v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patchDownload

From fa6d50a9fea6be46bd69865760d5949ab4bf1f2f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v2] Ensure nbtree leaf tuple keys are always unique.

Make comparisons of nbtree index tuples consider heap TID as a
tie-breaker attribute.  Add a separate heap TID attribute to pivot
tuples to make heap TID a first class part of the key space on all
levels of the tree.  The heap TID attribute is sorted in DESC order,
which makes space usage across leaf pages of duplicates have
approximately the same space usage characteristics as before.

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so this patch also adds suffix
truncation of pivot tuples.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though that is only
a secondary goal.

This is a proof of concept patch, which is probably only useful as part
of some much larger effort to add cheap retail index tuple deletion.  It
has several significant unresolved issues, including:

* It fails to deal with on-disk compatibility/pg_upgrade.  It also
slightly reduces the maximum amount of space usable for an index tuple,
in order to reserve room for a possible heap TID in a pivot tuple.
(This reduction in the maximum tuple size may ultimately be deemed
acceptable, and in any case seems impossible to avoid.)

* It regresses performance with some workloads to an extent that's not
acceptable.  This may be improved in a future version.
---
 contrib/amcheck/verify_nbtree.c       | 219 +++++++++++++++++++++++++---------
 src/backend/access/nbtree/README      |  70 ++++++-----
 src/backend/access/nbtree/nbtinsert.c | 133 +++++++++++++--------
 src/backend/access/nbtree/nbtpage.c   |   6 +-
 src/backend/access/nbtree/nbtsearch.c |  98 +++++++++++++--
 src/backend/access/nbtree/nbtsort.c   |  55 ++++++---
 src/backend/access/nbtree/nbtutils.c  | 155 +++++++++++++++++++-----
 src/backend/access/nbtree/nbtxlog.c   |   3 +
 src/backend/utils/sort/tuplesort.c    |  13 +-
 src/include/access/nbtree.h           |  71 ++++++++---
 src/test/regress/expected/join.out    |   2 +-
 11 files changed, 611 insertions(+), 214 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..2358bfa94d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,25 +132,27 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+					 OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_g_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							   Page other, int tupnkeyatts, ScanKey key,
+							   ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
@@ -834,8 +843,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +913,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTID(itup);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +950,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +976,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1037,28 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple	righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTID(righttup);
+			}
+
+			if (righttup &&
+				!invariant_g_offset(state, righttupnkeyatts, rightkey,
+									rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1101,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1115,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1130,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1319,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1336,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1385,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1435,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts, targetkey,
+										  scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1782,51 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		rheaptid = BTreeTupleGetHeapTID(ritup);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1835,90 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we need
+	 * to force to be interpreted as negative infinity, since scan key has to
+	 * be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+							 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid = BTreeTupleGetHeapTID(child);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..0782f0129c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -46,18 +46,15 @@ the real "key" at all, just at the link field.)  We can distinguish
 items at the leaf level in the same way, by examining their links to
 heap tuples; we'd never have two items for the same heap tuple.
 
-Lehman and Yao assume that the key range for a subtree S is described
+Lehman and Yao require that the key range for a subtree S is described
 by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+page.  A search that finds exact equality to a bounding key in an upper
+tree level must descend to the left of that key to ensure it finds any
+equal keys.  An insertion that sees the high key of its target page is
+equal to the key to be inserted cannot move right, since the downlink
+for the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity downlink).
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +607,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +659,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 907cce0724..4c4f7d8835 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -180,7 +180,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
+				_bt_compare(rel, indnkeyatts, itup_scankey, &itup->t_tid, page,
 							P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
@@ -216,9 +216,12 @@ top:
 
 	if (!fastpath)
 	{
+		ItemPointer scantid =
+			(checkUnique != UNIQUE_CHECK_NO ? NULL : &itup->t_tid);
+
 		/* find the first page containing this key */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, scantid, false,
+						   &buf, BT_WRITE, NULL);
 
 		/* trade in our read lock for a write lock */
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -231,8 +234,8 @@ top:
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
-							true, stack, BT_WRITE, NULL);
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, scantid,
+							false, true, stack, BT_WRITE, NULL);
 	}
 
 	/*
@@ -261,7 +264,8 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		/* Find position while excluding heap TID attribute */
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL, false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -285,6 +289,25 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/*
+		 * Be careful to not get confused about user attribute position and
+		 * insertion position.
+		 *
+		 * XXX: This is ugly as sin, and clearly needs a lot more work.  While
+		 * not having this code does not seem to affect regression tests, we
+		 * almost certainly need to do something here for the case where
+		 * _bt_check_unique() traverses many pages, each filled with logical
+		 * duplicates.
+		 */
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, &itup->t_tid,
+							false, true, stack, BT_WRITE, NULL);
+		/*
+		 * Always invalidate hint
+		 *
+		 * FIXME: This is unacceptable.
+		 */
+		offset = InvalidOffsetNumber;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -564,11 +587,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -700,6 +723,18 @@ _bt_findinsertloc(Relation rel,
 	 * pages).  Currently the probability of moving right is set at 0.99,
 	 * which may seem too high to change the behavior much, but it does an
 	 * excellent job of preventing O(N^2) behavior with many equal keys.
+	 *
+	 * TODO: Support this old approach for pre-pg_upgrade indexes.
+	 *
+	 * None of this applies when all items in the tree are unique, since the
+	 * new item cannot go on either page if it's equal to the high key.  The
+	 * original L&Y invariant that we now follow is that high keys must be
+	 * less than or equal to all items on the page, and strictly less than
+	 * the right sibling items (since the high key also becomes the downlink
+	 * to the right sibling in parent after a page split).  It's very
+	 * unlikely that it will be equal anyway, since there will be explicit
+	 * heap TIDs in pivot tuples in the event of many duplicates, but it can
+	 * happen when heap TID recycling takes place.
 	 *----------
 	 */
 	movedright = false;
@@ -731,8 +766,7 @@ _bt_findinsertloc(Relation rel,
 		 * nope, so check conditions (b) and (c) enumerated above
 		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
@@ -792,7 +826,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -851,11 +885,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -1143,8 +1178,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1214,7 +1247,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1247,25 +1282,35 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate attributes of the high key item before inserting it on the left
+	 * page.  This can only happen at the leaf level, since in general all
+	 * pivot tuple values originate from leaf level high keys.  This isn't just
+	 * about avoiding unnecessary work, though; truncating unneeded key suffix
+	 * attributes can only be performed at the leaf level anyway.  This is
+	 * because a pivot tuple in a grandparent page must guide a search not only
+	 * to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
+	 *
+	 * TODO: Give a little weight to how large the final downlink will be when
+	 * deciding on a split point.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber	lastleftoffnum = OffsetNumberPrev(firstright);
+
+		lefthikey = _bt_suffix_truncate(rel, origpage, lastleftoffnum , item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1487,22 +1532,11 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
+		loglhikey = true;
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -2210,7 +2244,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2322,8 +2357,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2337,12 +2372,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a24e64156a..25b24b1d66 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1415,8 +1415,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				itup_scankey = _bt_mkscankey(rel, targetkey);
 				/* find the leftmost leaf page containing this key */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0bcfa10b86..1e4a82bf77 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -94,8 +94,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * any incomplete splits encountered during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 
@@ -130,7 +130,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  BT_READ, snapshot);
 
@@ -144,7 +144,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid, nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -215,6 +215,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -275,7 +276,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,6 +325,7 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			bool nextkey)
 {
 	Page		page;
@@ -371,7 +373,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -428,24 +430,36 @@ int32
 _bt_compare(Relation rel,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	ItemPointer  heapTid;
 	IndexTuple	itup;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
 
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -459,7 +473,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -510,8 +525,69 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	return ItemPointerCompare(heapTid, scantid);
+}
+
+/*
+ * Return how many attributes to leave when truncating.
+ *
+ * This only considers key attributes, since non-key attributes should always
+ * be truncated away.  We only need attributes up to and including the first
+ * distinguishing attribute.
+ *
+ * This can return a number of attributes that is one greater than the number
+ * of key attributes actually found in the first right tuple.  This indicates
+ * that the caller must use the leftmost heap TID as a unique-ifier in its new
+ * high key tuple.
+ */
+int
+_bt_leave_natts(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+				IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (untruncated) case, since caller cares about whether
+	 * or not it can avoid appending a heap TID as a unique-ifier
+	 */
+	for (leavenatts = 1; leavenatts <= nkeyatts; leavenatts++)
+	{
+		if (_bt_compare(rel, leavenatts, skey, NULL, leftpage, lastleftoffnum) > 0)
+			break;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
 }
 
 /*
@@ -1027,7 +1103,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
+	stack = _bt_search(rel, keysCount, scankeys, NULL, nextkey, &buf, BT_READ,
 					   scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
@@ -1057,7 +1133,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, NULL, nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..6579021a04 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -796,8 +796,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -880,17 +878,17 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
-			IndexTuple	truncated;
-			Size		truncsz;
+			OffsetNumber	lastleftoffnum = OffsetNumberPrev(last_off);
+			IndexTuple		truncated;
+			Size			truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
-			 * in internal pages are either negative infinity items, or get
-			 * their contents from copying from one level down.  See also:
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks in
+			 * internal pages are either negative infinity items, or get their
+			 * contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
 			 * Since the truncated tuple is probably smaller than the
@@ -904,8 +902,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
+			 *
+			 * TODO: Give a little weight to how large the final downlink will
+			 * be when deciding on a split point.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			truncated = _bt_suffix_truncate(wstate->index, opage,
+											lastleftoffnum, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +926,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +973,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1032,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1127,6 +1131,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1140,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1156,22 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all keys
+				 * in the index are physically unique.
+				 *
+				 * Deliberately invert the order, since TIDs "sort DESC".
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index acb944357a..b9f9883bdd 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,27 +56,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +103,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple
+		 * due to suffix truncation.  Keys built from truncated attributes
+		 * are defensively represented as NULL values, though they should
+		 * still not be allowed to participate in comparisons (caller must
+		 * be sure to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2104,116 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
  * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * attributes copied from caller's itup argument.  If rel is an INCLUDE index,
+ * non-key attributes are always truncated away, since they're not part of the
+ * key space, and are not used in pivot tuples.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes after an already distinct pair of attributes.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Truncated tuple is guaranteed to be no larger than the original plus space
+ * for an extra heap TID tie-breaker attribute, which is important for staying
+ * under the 1/3 of a page restriction on tuple size.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated key
+ * attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+					IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc		itupdesc = RelationGetDescr(rel);
+	int16			natts = IndexRelationGetNumberOfAttributes(rel);
+	int16			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int				leavenatts;
+	IndexTuple		pivot;
+	ItemId			lastleftitem;
+	IndexTuple		lastlefttuple;
+	Size			newsize;
+
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, leftpage, lastleftoffnum, firstright);
+
+	if (leavenatts <= natts)
+	{
+		IndexTuple		tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where we
+		 * can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
 
-	return truncated;
+		/*
+		 * Only non-key attributes could be truncated away.  They are not
+		 * considered part of the key space, so it's still necessary to add a
+		 * heap TID attribute to the new pivot tuple.  Create enlarged copy of
+		 * truncated right tuple copy, to fit heap TID.
+		 */
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible. Create enlarged copy of first right
+		 * tuple, to fit heap TID
+		 */
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We must use heap TID as a unique-ifier in new pivot tuple, since no user
+	 * key attributes could be truncated away.  The heap TID must comes from
+	 * the last tuple on the left page, since new downlinks must be a strict
+	 * lower bound on new right page.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+	/* Copy last left item's heap TID into new pivot tuple */
+	lastleftitem = PageGetItemId(leftpage, lastleftoffnum);
+	lastlefttuple = (IndexTuple) PageGetItem(leftpage, lastleftitem);
+	memcpy((char *) pivot + newsize - MAXALIGN(sizeof(ItemPointerData)),
+		   &lastlefttuple->t_tid, sizeof(ItemPointerData));
+	/* Tuple has all key attributes */
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetHeapTID(pivot);
+	return pivot;
 }
 
 /*
@@ -2137,6 +2236,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2256,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2266,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2277,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2310,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..f1d286c1ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -252,6 +252,9 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	 * When the high key isn't present is the wal record, then we assume it to
 	 * be equal to the first key on the right page.  It must be from the leaf
 	 * level.
+	 *
+	 * FIXME:  We current always log the high key.  Is it worth trying to
+	 * salvage case where logging isn't strictly necessary, and can be avoided?
 	 */
 	if (!lhighkey)
 	{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9035..2a0b64ad47 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,23 +4057,26 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required
+	 * for btree indexes, since heap TID is treated as an implicit last
+	 * key attribute in order to ensure that all keys in the index are
+	 * physically unique.
+	 *
+	 * Deliberately invert the order, since TIDs "sort DESC".
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
 		if (blk1 != blk2)
-			return (blk1 < blk2) ? -1 : 1;
+			return (blk1 < blk2) ? 1 : -1;
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
 		if (pos1 != pos2)
-			return (pos1 < pos2) ? -1 : 1;
+			return (pos1 < pos2) ? 1 : -1;
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..f6208132b3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,11 +122,21 @@ typedef struct BTMetaPageData
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
-#define BTMaxItemSize(page) \
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -204,12 +214,10 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.
  *
  * The 12 least significant offset bits are used to represent the number of
  * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
@@ -219,6 +227,8 @@ typedef struct BTMetaPageData
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented in pivot tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +251,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +268,32 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get/set implicit tie-breaker heap-TID attribute, if any.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has t_tid that
+ * points to heap.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &(itup)->t_tid \
+	)
+#define BTreeTupleSetHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -560,15 +593,17 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int _bt_leave_natts(Relation rel, Page leftpage,
+						   OffsetNumber lastleftoffnum, IndexTuple firstright);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -601,7 +636,9 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, Page leftpage,
+									  OffsetNumber lastleftoffnum,
+									  IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..2c20cea4b9 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5896,8 +5896,8 @@ inner join j1 j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
  id1 | id2 | id1 | id2 
 -----+-----+-----+-----
-   1 |   1 |   1 |   1
    1 |   2 |   1 |   2
+   1 |   1 |   1 |   1
 (2 rows)
 
 reset enable_nestloop;
-- 
2.14.1

#12

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#11)

3 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]/messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

[1]: /messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

dell-server.txttext/plain; charset=US-ASCII; name=dell-server.txtDownload

tpcb.sqlapplication/sql; name=tpcb.sqlDownload

v3-0001-Make-all-nbtree-index-tuples-have-unique-keys.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Make-all-nbtree-index-tuples-have-unique-keys.patchDownload

From 481cb91b203f24abb2ff13d10d5e30143eb8974f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v3] Make all nbtree index tuples have unique keys.

Make nbtree treat all index tuples as having a heap TID trailing
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, though for now this is only used by insertions that need to find a
leaf page to insert a tuple on.  This general approach has numerous
benefits for performance, and may enable a later enhancement that has
nbtree vacuuming perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is also introduced.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though this is of
secondary importance.  Truncation can only occur at the attribute
granularity, which isn't particularly effective, but works well enough
for now.

We completely remove the logic that allows a search for free space among
multiple pages full of duplicates to "get tired".  This has significant
benefits for free space management in secondary indexes on low
cardinality attributes.  Unique checking still has to start with the
first page that its heap-TID-free insertion scan key leads it to, though
insertion can then quickly find the leaf page and offset its new tuple
unambiguously belongs at (in the unique case there will rarely be
multiple pages full of duplicates, so being unable to descend the tree
to directly find the insertion target leaf page will seldom be much of a
problem).

Note that this version of the patch doesn't yet deal with on-disk
compatibility issues.  That will follow in a later revision.
---
 contrib/amcheck/verify_nbtree.c              | 256 +++++++++---
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 src/backend/access/nbtree/README             | 114 +++--
 src/backend/access/nbtree/nbtinsert.c        | 418 ++++++++++++-------
 src/backend/access/nbtree/nbtpage.c          |   8 +-
 src/backend/access/nbtree/nbtsearch.c        | 168 ++++++--
 src/backend/access/nbtree/nbtsort.c          |  63 ++-
 src/backend/access/nbtree/nbtutils.c         | 267 ++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  41 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/storage/page/bufpage.c           |   4 +-
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  |  85 +++-
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/join.out           |   2 +-
 16 files changed, 1063 insertions(+), 416 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..eb34b22c30 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,30 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+					 OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_g_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							   Page other, int tupnkeyatts, ScanKey key,
+							   ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+								  IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +845,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +915,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTIDCareful(state, itup, P_ISLEAF(topaque));
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +952,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +978,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1039,29 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple	righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTIDCareful(state, righttup,
+														   P_ISLEAF(topaque));
+			}
+
+			if (righttup &&
+				!invariant_g_offset(state, righttupnkeyatts, rightkey,
+									rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1104,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1118,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1133,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1322,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1339,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1388,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1438,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts, targetkey,
+										  scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1785,54 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1841,93 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we need
+	 * to force to be interpreted as negative infinity, since scan key has to
+	 * be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+							 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child,
+												   P_ISLEAF(copaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2083,31 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						state->targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..dc6c65d201 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -34,30 +34,47 @@ Differences to the Lehman & Yao algorithm
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
+The requirement that all btree keys be unique is satisfied by treating
+heap TID as a tie-breaker attribute.  Logical duplicates are sorted in
+descending item pointer order.  We don't use btree keys to
+disambiguate downlinks from the internal pages during a page split,
+though: only one entry in the parent level will be pointing at the
+page we just split, so the link fields can be used to re-find
+downlinks in the parent via a linear search.
 
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+Lehman and Yao require that the key range for a subtree S is described
+by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the
+parent page, but do not account for the need to search the tree based
+only on leading index attributes in a composite index.  Since heap TID
+is always used to make btree keys unique (even in unique indexes),
+every btree index is treated as a composite index internally.  A
+search that finds exact equality to a pivot tuple in an upper tree
+level must descend to the left of that key to ensure it finds any
+equal keys, even when scan values were provided for all attributes.
+An insertion that sees that the high key of its target page is equal
+to the key to be inserted cannot move right, since the downlink for
+the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity
+downlink).
+
+We might be able to avoid moving left in the event of a full match on
+all attributes up to and including the heap TID attribute, but that
+would be a very narrow win, since it's rather unlikely that heap TID
+will be an exact match.  We can avoid moving left unnecessarily when
+all user-visible keys are equal by avoiding exact equality;  a
+sentinel value that's less than any possible heap TID is used by most
+index scans.  This is effective because of suffix truncation.  An
+"extra" heap TID attribute in pivot tuples is almost always avoided.
+All truncated attributes compare as minus infinity, even against a
+sentinel value, and the sentinel value is less than any real TID
+value, so an unnecessary move to the left is avoided regardless of
+whether or not a heap TID is present in the otherwise-equal pivot
+tuple.  Consistently moving left on full equality is also needed by
+page deletion, which re-finds a leaf page by descending the tree while
+searching on the leaf page's high key.  If we wanted to avoid moving
+left without breaking page deletion, we'd have to avoid suffix
+truncation, which could never be worth it.
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +627,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +679,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4b2b4746f7..7d0556b91d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -76,9 +76,11 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 bool disjunctpass,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
 				 bool *newitemonleft);
+static bool _bt_split_isdisjunct(Page page, OffsetNumber item);
 static void _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
@@ -113,9 +115,12 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			indnkeyatts;
 	ScanKey		itup_scankey;
+	ItemPointer itup_scantid;
 	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -123,6 +128,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
+	/* we use a heap TID with scan key if this isn't unique case */
+	itup_scantid = (checkUnique == UNIQUE_CHECK_NO ? &itup->t_tid : NULL);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -149,8 +156,6 @@ top:
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -180,7 +185,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
+				_bt_compare(rel, indnkeyatts, itup_scankey, itup_scantid, page,
 							P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
@@ -217,8 +222,8 @@ top:
 	if (!fastpath)
 	{
 		/* find the first page containing this key */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, itup_scantid, false,
+						   &buf, BT_WRITE, NULL);
 
 		/* trade in our read lock for a write lock */
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -231,8 +236,8 @@ top:
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
-							true, stack, BT_WRITE, NULL);
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, itup_scantid,
+							false, true, stack, BT_WRITE, NULL);
 	}
 
 	/*
@@ -242,12 +247,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter of
+	 * the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -261,7 +267,11 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		Assert(itup_scantid == NULL);
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL,
+							 P_FIRSTDATAKEY(lpageop), false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -299,7 +309,7 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
+		/* do the insertion, possibly on a page to the right in unique case */
 		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
 						  stack, heapRel);
 		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
@@ -564,11 +574,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -612,31 +622,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
- *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
- *
  *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		where the new tuple could be inserted if we were to treat it as having
+ *		no implicit heap TID; only callers that just called _bt_check_unique()
+ *		provide this hint (all other callers should set *offsetptr to
+ *		InvalidOffsetNumber).  The caller should hold an exclusive lock on
+ *		*bufptr in all cases.  On exit, they both point to the chosen insert
+ *		location in all cases.  If _bt_findinsertloc decides to move right, the
+ *		lock and pin on the original page will be released, and the new page
+ *		returned to the caller is exclusively locked instead.
+ *
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.
  *
  *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		type scan key for it.  We take a "scantid" heap TID attribute value
+ *		from newtup directly.
  */
 static void
 _bt_findinsertloc(Relation rel,
@@ -652,9 +653,9 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(buf);
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
+	bool		hintinvalidated;
 	OffsetNumber newitemoff;
+	OffsetNumber lowitemoff;
 	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -684,59 +685,30 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/* firstlegaloff/offsetptr hint (if any) assumed valid initially */
+	hintinvalidated = false;
+
+	/*
+	 * TODO: Restore the logic for finding a page to insert on in the event of
+	 * many duplicates for pre-pg_upgrade indexes.  The whole search through
+	 * pages of logical duplicates to determine where to insert seems like
+	 * something that has little upside, but that doesn't make it okay to
+	 * ignore the performance characteristics after pg_upgrade is run, but
+	 * before a REINDEX can run to bump BTREE_VERSION.
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	while (true)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
-		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
-		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
-		 * step right to next non-dead page
+		 * step right to next non-dead page.  this is only needed for unique
+		 * indexes, and pg_upgrade'd indexes that still use BTREE_VERSION 2 or
+		 * 3, where heap TID isn't considered to be a part of the keyspace.
 		 *
 		 * must write-lock that page before releasing write lock on current
 		 * page; else someone else's _bt_check_unique scan could fail to see
@@ -775,24 +747,40 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		hintinvalidated = true;
+	}
+
+	Assert(P_ISLEAF(lpageop));
+
+	/*
+	 * Perform micro-vacuuming of the page we're about to insert tuple on to if
+	 * it looks like it has LP_DEAD items.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		hintinvalidated = true;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Consider using caller's hint to avoid repeated binary search effort.
+	 *
+	 * Note that the hint is only provided by callers that checked uniqueness.
+	 * The hint is used as a lower bound for a new binary search, since
+	 * caller's original binary search won't have specified a scan tid.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
+	if (firstlegaloff == InvalidOffsetNumber || hintinvalidated)
+		lowitemoff = P_FIRSTDATAKEY(lpageop);
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	{
+		Assert(firstlegaloff == _bt_binsrch(rel, buf, keysz, scankey, NULL,
+											P_FIRSTDATAKEY(lpageop), false));
+		lowitemoff = firstlegaloff;
+	}
+
+	newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid,
+							 lowitemoff, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -851,11 +839,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -900,7 +889,7 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
+		firstright = _bt_findsplitloc(rel, page, true,
 									  newitemoff, itemsz,
 									  &newitemonleft);
 
@@ -1143,8 +1132,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1214,7 +1201,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1228,8 +1217,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1247,25 +1237,77 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate attributes of the high key item before inserting it on the left
+	 * page.  This can only happen at the leaf level, since in general all
+	 * pivot tuple values originate from leaf level high keys.  This isn't just
+	 * about avoiding unnecessary work, though; truncating unneeded key suffix
+	 * attributes can only be performed at the leaf level anyway.  This is
+	 * because a pivot tuple in a grandparent page must guide a search not only
+	 * to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber	lastleftoff;
+		IndexTuple		lastleft;
+
+		/*
+		 * Determine which tuple is on the left side of the split point, and
+		 * generate truncated copy of the right tuple.  Truncate as
+		 * aggressively as possible without generating a high key for the left
+		 * side of the split (and later downlink for the right side) that fails
+		 * to distinguish each side.  The new high key needs to be strictly
+		 * less than all tuples on the right side of the split, but can be
+		 * equal to items on the left side of the split.  We almost always find
+		 * a way to make it strictly greater than lastleft, though.
+		 *
+		 * Handle the case where the incoming tuple is about to become the last
+		 * item on the left side of the split.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+			lastleft = newitem;
+		else
+		{
+			lastleftoff = OffsetNumberPrev(firstright);
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
+#ifdef DEBUG_SPLITS
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *lastleftstr;
+			char	   *firstrightstr;
+
+			index_deform_tuple(lastleft, itupdesc, values, isnull);
+			lastleftstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(item, itupdesc, values, isnull);
+			firstrightstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "split of block %u "
+				 "last left %s first right %s "
+				 "attributes in new high key for left page %u%s",
+				 BufferGetBlockNumber(buf), lastleftstr, firstrightstr,
+				 BTreeTupleGetNAtts(lefthikey, rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (plus heap TID)":"");
+		}
+#endif
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1458,7 +1500,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1487,22 +1528,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1520,9 +1549,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1575,10 +1602,31 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * righthand page, plus a boolean indicating whether the new tuple goes on
  * the left or right page.  The bool is necessary to disambiguate the case
  * where firstright == newitemoff.
+ *
+ * The top-level caller should specify disjunctpass as true.  This makes
+ * finding a split location refuse to consider candidate split points that
+ * will result in a non-discriminating downlink for the right half of the
+ * split in the parent page.  If we fail to find a split point this way, we
+ * call ourselves recursively without insisting on a disjunct split point.
+ * This avoids cases where suffix truncation must add an extra heap TID to
+ * the new pivot during a leaf page split.  It also makes it far less likely
+ * that any pivot tuples that have a heap attribute TID added despite our
+ * best efforts will get used as a downlink a second time, in the
+ * grandparent.  There should only be downlinks with an extra heap TID
+ * attribute in pages at higher levels of the tree when a very large
+ * proportion of leaf tuples happen to be logical duplicates.
+ *
+ * The disjunctpass strategy may seem very aggressive, as it could lead to
+ * each half of the split having very different amounts of free space.
+ * Note, however, that logical duplicate downlinks in internal pages can
+ * only help _bt_search() callers that pass a scantid argument (i.e.
+ * UNIQUE_CHECK_NO inserters).  Even there, the effect on performance ought
+ * to be self-limiting.
  */
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 bool disjunctpass,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
 				 bool *newitemonleft)
@@ -1661,6 +1709,14 @@ _bt_findsplitloc(Relation rel,
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
 
+		/*
+		 * FIXME:  This doesn't consider the case where the tuple to be
+		 * inserted is placed immediately before or after item.
+		 */
+		if (disjunctpass && offnum > P_FIRSTDATAKEY(opaque) &&
+			!_bt_split_isdisjunct(page, offnum))
+			goto skipnondisjunct;
+
 		/*
 		 * Will the new item go to left or right of split?
 		 */
@@ -1688,6 +1744,8 @@ _bt_findsplitloc(Relation rel,
 			break;
 		}
 
+skipnondisjunct:
+
 		olddataitemstoleft += itemsz;
 	}
 
@@ -1699,18 +1757,80 @@ _bt_findsplitloc(Relation rel,
 	if (newitemoff > maxoff && !goodenoughfound)
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
 	if (!state.have_split)
+	{
+		/*
+		 * We couldn't find a split point using disjunctpass strategy.  Take
+		 * a second pass at it, this time only weighing space utilization on
+		 * the page that we're about to split.
+		 */
+		if (disjunctpass)
+			return _bt_findsplitloc(rel, page, false, newitemoff, newitemsz,
+									newitemonleft);
+		/*
+		 * I believe it is not possible to fail to find a feasible split, but
+		 * just in case ...
+		 */
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
+	}
 
 	*newitemonleft = state.newitemonleft;
 	return state.firstright;
 }
 
+/*
+ * _bt_split_isdisjunct - does split point separate key space well?
+ *
+ * firstoldonright is a candidate split point.
+ *
+ * Checks if point between the would-be last left and first right tuples on the
+ * hypothetical post-split halves are disjunct.
+ */
+static bool
+_bt_split_isdisjunct(Page page, OffsetNumber firstoldonright)
+{
+	BTPageOpaque	opaque;
+	ItemId			itemid;
+	IndexTuple		itup;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	itemid = PageGetItemId(page, firstoldonright);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+
+	if (P_ISLEAF(opaque))
+	{
+		IndexTuple	lastitup;
+		char	   *curTup;
+		char	   *lastTup;
+		Size		datasz;
+
+		/*
+		 * Use memcmp() to cheaply approximate real comparisons.
+		 *
+		 * XXX: This clearly needs more work, especially to make things like
+		 * TPC-C's order_lines pkey make better use of suffix truncation.
+		 *
+		 * FIXME: What about covering index pivot tuples?  Clearly using
+		 * memcmp() there does not approximate comparing key attributes very
+		 * well at all.
+		 */
+		itemid = PageGetItemId(page, OffsetNumberPrev(firstoldonright));
+		lastitup = (IndexTuple) PageGetItem(page, itemid);
+		lastTup = (char *) lastitup + IndexInfoFindDataOffset(lastitup->t_info);
+		curTup = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
+		datasz = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+		if (IndexTupleSize(lastitup) == IndexTupleSize(itup) &&
+			memcmp(lastTup, curTup, datasz) == 0)
+			return false;
+	}
+	else if (BTreeTupleGetHeapTID(itup) != NULL)
+		return false;
+
+	return true;
+}
+
 /*
  * Subroutine to analyze a particular possible split choice (ie, firstright
  * and newitemonleft settings), and record the best split so far in *state.
@@ -1761,6 +1881,9 @@ _bt_checksplitloc(FindSplitData *state,
 	 */
 	leftfree -= firstrightitemsz;
 
+	/* Charge for case where a heap TID is added */
+	leftfree -= MAXALIGN(sizeof(ItemPointerData));
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -2210,7 +2333,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2322,8 +2446,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2337,12 +2461,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 2e959da5f8..a3f25850cc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1413,10 +1413,12 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
+				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0bcfa10b86..b5e22bcb09 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -69,11 +69,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 
 
 /*
- *	_bt_search() -- Search the tree for a particular scankey,
+ *	_bt_search() -- Search the tree for a particular scankey + scantid,
  *		or more precisely for the first leaf page it could be on.
  *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * but it can omit the rightmost column(s) of the index.  The scantid
+ * argument may also be omitted (caller passes NULL), since it's logically
+ * the "real" rightmost attribute.
  *
  * When nextkey is false (the usual case), we are looking for the first
  * item >= scankey.  When nextkey is true, we are looking for the first
@@ -94,8 +96,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * any incomplete splits encountered during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 
@@ -130,7 +132,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  BT_READ, snapshot);
 
@@ -144,7 +146,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid,
+							 P_FIRSTDATAKEY(opaque), nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -157,8 +160,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * faster than comparing the keys themselves.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -215,6 +218,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -275,7 +279,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -307,6 +311,12 @@ _bt_moveright(Relation rel,
  * particular, this means it is possible to return a value 1 greater than the
  * number of keys on the page, if the scankey is > all keys on the page.)
  *
+ * Caller passes own low value for binary search.  This can be used to
+ * resume a partial binary search without repeated effort.  _bt_check_unique
+ * callers use this to avoid repeated work.  This only works when a buffer
+ * lock is held throughout, and we're passed a leaf page both times, and
+ * nextkey is false.
+ *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
  * is true.  (Since _bt_compare treats the first data key of such a page as
@@ -324,19 +334,19 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
+			OffsetNumber low,
 			bool nextkey)
 {
 	Page		page;
 	BTPageOpaque opaque;
-	OffsetNumber low,
-				high;
+	OffsetNumber high;
 	int32		result,
 				cmpval;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
 
 	/*
@@ -371,7 +381,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -401,20 +411,8 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
- *	page/offnum: location of btree item to be compared to.
- *
- *		This routine returns:
- *			<0 if scankey < tuple at offnum;
- *			 0 if scankey == tuple at offnum;
- *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -428,24 +426,66 @@ int32
 _bt_compare(Relation rel,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			Page page,
 			OffsetNumber offnum)
 {
-	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
 
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	return _bt_tuple_compare(rel, keysz, scankey, scantid, itup);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
+ * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ *	keysz: number of key conditions to be checked (might be less than the
+ *		number of index columns!)
+ *	page/offnum: location of btree item to be compared to.
+ *
+ *		This routine returns:
+ *			<0 if scankey < tuple at offnum;
+ *			 0 if scankey == tuple at offnum;
+ *			>0 if scankey > tuple at offnum.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
+ *----------
+ */
+int32
+_bt_tuple_compare(Relation rel,
+				  int keysz,
+				  ScanKey scankey,
+				  ItemPointer scantid,
+				  IndexTuple itup)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ItemPointer	heapTid;
+	int			ntupatts;
+	int			ncmpkey;
+	int			i;
+
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -459,7 +499,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -510,8 +551,31 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	return ItemPointerCompare(heapTid, scantid);
 }
 
 /*
@@ -540,6 +604,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
+	BTPageOpaque opaque;
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
@@ -547,6 +612,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	bool		goback;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ItemPointer scantid;
+	ItemPointerData minscantid;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -796,6 +863,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * scankeys[] array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scantid = NULL;
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -932,6 +1000,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/*
+	 * When all key attributes will be in insertion scankey, manufacture
+	 * sentinel scan tid that's less than any possible heap TID in the index.
+	 * This is still greater than minus infinity to _bt_compare, allowing
+	 * _bt_search to follow a downlink with scankey-equal attributes, but a
+	 * truncated-away heap TID.
+	 *
+	 * If we didn't do this then affected index scans would have to
+	 * unnecessarily visit an extra page before moving right to the page they
+	 * should have landed on from the parent in the first place.
+	 *
+	 * (Note that implementing this by adding hard-coding to _bt_compare is
+	 * unworkable, since some _bt_search callers need to re-find a leaf page
+	 * using the page's high key.)
+	 */
+	if (keysCount >= IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		scantid = &minscantid;
+
+		/* Heap TID attribute uses DESC ordering */
+		ItemPointerSetBlockNumber(scantid, InvalidBlockNumber);
+		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+	}
+
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
@@ -1024,11 +1116,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/*
-	 * Use the manufactured insertion scan key to descend the tree and
-	 * position ourselves on the target leaf page.
+	 * Use the manufactured insertion scan key (and possibly a scantid) to
+	 * descend the tree and position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, keysCount, scankeys, scantid, nextkey, &buf,
+					   BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1057,7 +1149,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(BufferGetPage(buf));
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, scantid,
+						 P_FIRSTDATAKEY(opaque), nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..24bae8a454 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -796,8 +796,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -880,19 +878,28 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
-			IndexTuple	truncated;
-			Size		truncsz;
+			IndexTuple		lastleft;
+			IndexTuple		truncated;
+			Size			truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
-			 * in internal pages are either negative infinity items, or get
-			 * their contents from copying from one level down.  See also:
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks in
+			 * internal pages are either negative infinity items, or get their
+			 * contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_suffix_truncate() can truncate away more
+			 * attributes, whereas the split point passed to _bt_split() is
+			 * chosen much more delicately.  Suffix truncation is mostly useful
+			 * because it can greatly improve space utilization for workloads
+			 * with random insertions.  It doesn't seem worthwhile to add
+			 * complex logic for choosing a split point here for a benefit that
+			 * is bound to be much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +912,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_suffix_truncate(wstate->index, lastleft, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +934,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +981,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1040,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1127,6 +1139,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1148,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1164,22 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all keys
+				 * in the index are physically unique.
+				 *
+				 * Deliberately invert the order, since TIDs "sort DESC".
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4528e87c83..4c46decb1d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,10 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+						   IndexTuple firstright);
+static void _bt_set_median_tid(ItemPointer lastleft, ItemPointer firstright,
+							   ItemPointer pivotheaptid);
 
 
 /*
@@ -56,27 +60,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +107,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple
+		 * due to suffix truncation.  Keys built from truncated attributes
+		 * are defensively represented as NULL values, though they should
+		 * still not be allowed to participate in comparisons (caller must
+		 * be sure to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2108,218 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to an insertion scan
+ * key).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that's larger than
+ * firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * Note that returned tuple's t_tid offset will hold the number of
+ * attributes present, so the original item pointer offset is not
+ * represented.  Caller should only change truncated tuple's downlink.  Note
+ * also that truncated key attributes are treated as containing "minus
+ * infinity" values by _bt_compare()/_bt_tuple_compare().  Returned tuple is
+ * guaranteed to be no larger than the original plus some extra space for a
+ * possible extra heap TID tie-breaker attribute, which is important for
+ * staying under the 1/3 of a page restriction on tuple size.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc		itupdesc = RelationGetDescr(rel);
+	int16			natts = IndexRelationGetNumberOfAttributes(rel);
+	int16			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int				leavenatts;
+	IndexTuple		pivot;
+	ItemPointer		pivotheaptid;
+	Size			newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple		tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where we
+		 * can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).  We should only call index_truncate_tuple() when user
+		 * attributes need to be truncated.
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key space,
+		 * so it's still necessary to add a heap TID attribute to the new pivot
+		 * tuple.  Create enlarged copy of our truncated right tuple copy, to
+		 * fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since attributes are all equal.  It's
+		 * necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * Create enlarged copy of first right tuple to fit heap TID.  We must
+	 * use heap TID as a unique-ifier in new pivot tuple, since no user key
+	 * attribute distinguishes which values belong on each side of the split
+	 * point.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Generate an artificial heap TID value for the new pivot tuple.  This
+	 * will be the median of the left and right heap TIDs, or a close
+	 * approximation.
+	 *
+	 * Note that we deliberately pass the firstright heap TID as low and the
+	 * lastleft heap TID as high, since the implicit heap TID attribute has
+	 * DESC sort order.
+	 *
+	 * Lehman and Yao require that the downlink to the right page, which is
+	 * to be inserted into the parent page in the second phase of a page
+	 * split be a strict lower bound on all current and future items on the
+	 * right page (this will be copied from the new high key for the left
+	 * side of the split).  New pivot's heap TID attribute may occasionally
+	 * be equal to the the lastleft heap TID, but it must never be equal to
+	 * firstright's heap TID.
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  MAXALIGN(sizeof(ItemPointerData)));
+	_bt_set_median_tid(&firstright->t_tid, &lastleft->t_tid, pivotheaptid);
+	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
+	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+
+	/* Mark tuple as containing all key attributes, plus TID attribute */
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (no truncated user attributes) case, since caller
+	 * cares about whether or not it can avoid appending a heap TID as a
+	 * unique-ifier
+	 */
+	leavenatts = 1;
+	for(;;)
+	{
+		if (leavenatts > nkeyatts)
+			break;
+		if (_bt_tuple_compare(rel, leavenatts, skey, NULL, lastleft) > 0)
+			break;
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
+}
+
+/*
+ * _bt_set_median_tid - set's item pointer to median TID value.
+ */
+static void
+_bt_set_median_tid(ItemPointer low, ItemPointer high,
+				   ItemPointer pivotheaptid)
+{
+	uint64		lowblock, highblock, medianblock;
+	uint32		lowoffset, highoffset, medianoffset;
+
+	Assert(ItemPointerCompare(low, high) < 0);
+
+	lowblock = ItemPointerGetBlockNumber(low);
+	highblock = ItemPointerGetBlockNumber(high);
+
+	lowoffset = ItemPointerGetOffsetNumber(low);
+	highoffset = ItemPointerGetOffsetNumber(high);
+
+	medianblock = (lowblock + highblock) / 2;
+	if (medianblock >= highblock)
+	{
+		medianblock = highblock;
+
+		/* Cannot allow result to equal low */
+		if (medianblock == highblock)
+			medianoffset = highoffset;
+		else
+			medianoffset = (lowoffset + highoffset) / 2;
+	}
+	else
+		medianoffset = Max(lowoffset, highoffset) + 1;
+
+	ItemPointerSetBlockNumber(pivotheaptid, medianblock);
+	ItemPointerSetOffsetNumber(pivotheaptid, medianoffset);
 }
 
 /*
@@ -2137,6 +2342,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2362,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2372,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2383,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2416,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..7c061e96d2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..ffeb0624fe 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -854,10 +854,8 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 	 * PageIndexTupleDelete is the best way.  Delete the items in reverse
 	 * order so we don't have to think about adjusting item numbers for
 	 * previous deletions.
-	 *
-	 * TODO: tune the magic number here
 	 */
-	if (nitems <= 2)
+	if (nitems <= 7)
 	{
 		while (--nitems >= 0)
 			PageIndexTupleDelete(page, itemnos[nitems]);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9035..2a0b64ad47 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,23 +4057,26 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required
+	 * for btree indexes, since heap TID is treated as an implicit last
+	 * key attribute in order to ensure that all keys in the index are
+	 * physically unique.
+	 *
+	 * Deliberately invert the order, since TIDs "sort DESC".
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
 		if (blk1 != blk2)
-			return (blk1 < blk2) ? -1 : 1;
+			return (blk1 < blk2) ? 1 : -1;
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
 		if (pos1 != pos2)
-			return (pos1 < pos2) ? -1 : 1;
+			return (pos1 < pos2) ? 1 : -1;
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..b5f46661bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -114,16 +114,27 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
+/* FIXME: Support versions 2 and 3 for the benefit of pg_upgrade users */
+#define BTREE_VERSION	4		/* current version number */
+#define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -204,12 +215,11 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.  A pivot tuple must have INDEX_ALT_TID_MASK set
+ * as of BTREE_VERSION 4, however.
  *
  * The 12 least significant offset bits are used to represent the number of
  * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
@@ -219,6 +229,8 @@ typedef struct BTMetaPageData
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented at end of tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +253,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +270,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -560,15 +605,18 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, OffsetNumber low,
+			bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, int keysz, ScanKey scankey,
+							   ItemPointer scantid, IndexTuple itup);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -601,7 +649,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, IndexTuple lastleft,
+									  IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..d0fd4229d2 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50: unused */
+/* 0x60: unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..2c20cea4b9 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5896,8 +5896,8 @@ inner join j1 j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
  id1 | id2 | id1 | id2 
 -----+-----+-----+-----
-   1 |   1 |   1 |   1
    1 |   2 |   1 |   2
+   1 |   1 |   1 |   1
 (2 rows)
 
 reset enable_nestloop;
-- 
2.17.1

#13

Andrey Lepikhov

a.lepikhov@postgrespro.ru

over 7 years ago

In reply to: Peter Geoghegan (#12)

2 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

I use v3 version of the patch for a Retail Indextuple Deletion and from
time to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP
... CASCADE deletions.
Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in
consistent state all time.
My hypothesis is: instability order of logical duplicates in index
relations on a pg_depend relation.
But 'updatable views' test not contains any sources of instability:
concurrent insertions, updates, vacuum and so on. This fact discourage me.
May be you have any ideas on this problem?

18.07.2018 00:21, Peter Geoghegan пишет:

Attached is my v3, which has some significant improvements:

* The hinting for unique index inserters within _bt_findinsertloc()
has been restored, more or less.

* Bug fix for case where left side of split comes from tuple being
inserted. We need to pass this to _bt_suffix_truncate() as the left
side of the split, which we previously failed to do. The amcheck
coverage I've added allowed me to catch this issue during a benchmark.
(I use amcheck during benchmarks to get some amount of stress-testing
in.)

* New performance optimization that allows us to descend a downlink
when its user-visible attributes have scankey-equal values. We avoid
an unnecessary move left by using a sentinel scan tid that's less than
any possible real heap TID, but still greater than minus infinity to
_bt_compare().

I am now considering pursuing this as a project in its own right,
which can be justified without being part of some larger effort to add
retail index tuple deletion (e.g. by VACUUM). I think that I can get
it to the point of being a totally unambiguous win, if I haven't
already. So, this patch is no longer just an interesting prototype of
a new architectural direction we should take. In any case, it has far
fewer problems than v2.

Testing the performance characteristics of this patch has proven
difficult. My home server seems to show a nice win with a pgbench
workload that uses a Gaussian distribution for the pgbench_accounts
queries (script attached). That seems consistent and reproducible. My
home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a
250GB capacity. With shared_buffers set to 12GB, 80 minute runs at
scale 4800 look like this:

Master:

25 clients:
tps = 15134.223357 (excluding connections establishing)

50 clients:
tps = 13708.419887 (excluding connections establishing)

75 clients:
tps = 12951.286926 (excluding connections establishing)

90 clients:
tps = 12057.852088 (excluding connections establishing)

Patch:

25 clients:
tps = 17857.863353 (excluding connections establishing)

50 clients:
tps = 14319.514825 (excluding connections establishing)

75 clients:
tps = 14015.794005 (excluding connections establishing)

90 clients:
tps = 12495.683053 (excluding connections establishing)

I ran this twice, and got pretty consistent results each time (there
were many other benchmarks on my home server -- this was the only one
that tested this exact patch, though). Note that there was only one
pgbench initialization for each set of runs. It looks like a pretty
strong result for the patch - note that the accounts table is about
twice the size of available main memory. The server is pretty well
overloaded in every individual run.

Unfortunately, I have a hard time showing much of any improvement on a
storage-optimized AWS instance with EBS storage, with scaled up
pgbench scale and main memory. I'm using an i3.4xlarge, which has 16
vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It
appears to more or less make no overall difference there, for reasons
that I have yet to get to the bottom of. I conceived this AWS
benchmark as something that would have far longer run times with a
scaled-up database size. My expectation was that it would confirm the
preliminary result, but it hasn't.

Maybe the issue is that it's far harder to fill the I/O queue on this
AWS instance? Or perhaps its related to the higher latency of EBS,
compared to the local SSD on my home server? I would welcome any ideas
about how to benchmark the patch. It doesn't necessarily have to be a
huge win for a very generic workload like the one I've tested, since
it would probably still be enough of a win for things like free space
management in secondary indexes [1]. Plus, of course, it seems likely
that we're going to eventually add retail index tuple deletion in some
form or another, which this is prerequisite to.

For a project like this, I expect an unambiguous, across the board win
from the committed patch, even if it isn't a huge win. I'm encouraged
by the fact that this is starting to look like credible as a
stand-alone patch, but I have to admit that there's probably still
significant gaps in my understanding of how it affects real-world
performance. I don't have a lot of recent experience with benchmarking
workloads like this one.

[1] /messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#14

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Andrey Lepikhov (#13)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I use v3 version of the patch for a Retail Indextuple Deletion and from time
to time i catch regression test error (see attachment).
As i see in regression.diff, the problem is instability order of DROP ...
CASCADE deletions.
Most frequently i get error on a test called 'updatable views'.
I check nbtree invariants during all tests, but index relations is in
consistent state all time.
My hypothesis is: instability order of logical duplicates in index relations
on a pg_depend relation.
But 'updatable views' test not contains any sources of instability:
concurrent insertions, updates, vacuum and so on. This fact discourage me.
May be you have any ideas on this problem?

It's caused by an implicit dependency on the order of items in an
index. See /messages/by-id/20180504022601.fflymidf7eoencb2@alvherre.pgsql.

I've been making "\set VERBOSITY terse" changes like this whenever it
happens in a new place. It seems to have finally stopped happening.
Note that this is a preexisting issue; there are already places in the
regression tests where we paper over the problem in a similar way. I
notice that it tends to happen when the machine running the regression
tests is heavily loaded.

--
Peter Geoghegan

#15

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#12)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]/messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1]: /messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v4-0001-Make-all-nbtree-index-tuples-have-unique-keys.patchapplication/octet-stream; name=v4-0001-Make-all-nbtree-index-tuples-have-unique-keys.patchDownload

From 040e66e6de8fbd1416eff46f1946597fab44a265 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v4 1/2] Make all nbtree index tuples have unique keys.

Make nbtree treat all index tuples as having a heap TID trailing
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, though for now this is only used by insertions that need to find a
leaf page to insert a tuple on.  This general approach has numerous
benefits for performance, and may enable a later enhancement that has
nbtree vacuuming perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is also introduced.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though this is of
secondary importance.  Truncation can only occur at the attribute
granularity, which isn't particularly effective, but works well enough
for now.

We completely remove the logic that allows a search for free space among
multiple pages full of duplicates to "get tired".  This has significant
benefits for free space management in secondary indexes on low
cardinality attributes.  Unique checking still has to start with the
first page that its heap-TID-free insertion scan key leads it to, though
insertion can then quickly find the leaf page and offset its new tuple
unambiguously belongs at (in the unique case there will rarely be
multiple pages full of duplicates, so being unable to descend the tree
to directly find the insertion target leaf page will seldom be much of a
problem).

Note that this version of the patch doesn't yet deal with on-disk
compatibility issues.  That will follow in a later revision.
---
 contrib/amcheck/verify_nbtree.c               | 258 ++++--
 contrib/pageinspect/expected/btree.out        |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out  |  10 +-
 src/backend/access/nbtree/README              | 114 ++-
 src/backend/access/nbtree/nbtinsert.c         | 815 ++++++++++++++----
 src/backend/access/nbtree/nbtpage.c           |   8 +-
 src/backend/access/nbtree/nbtsearch.c         | 197 ++++-
 src/backend/access/nbtree/nbtsort.c           |  66 +-
 src/backend/access/nbtree/nbtutils.c          | 233 ++++-
 src/backend/access/nbtree/nbtxlog.c           |  41 +-
 src/backend/access/rmgrdesc/nbtdesc.c         |   8 -
 src/backend/storage/page/bufpage.c            |   4 +-
 src/backend/utils/sort/tuplesort.c            |  20 +-
 src/include/access/nbtree.h                   | 101 ++-
 src/include/access/nbtxlog.h                  |  19 +-
 src/test/regress/expected/domain.out          |   4 +-
 src/test/regress/expected/foreign_key.out     |   4 +-
 src/test/regress/expected/join.out            |   2 +-
 src/test/regress/expected/truncate.out        |   5 +-
 src/test/regress/expected/typed_table.out     |  11 +-
 src/test/regress/expected/updatable_views.out |  18 +-
 src/test/regress/sql/domain.sql               |   2 +
 src/test/regress/sql/foreign_key.sql          |   2 +
 src/test/regress/sql/truncate.sql             |   2 +
 src/test/regress/sql/typed_table.sql          |   2 +
 src/test/regress/sql/updatable_views.sql      |   2 +
 26 files changed, 1465 insertions(+), 485 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..15b527c7e3 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,30 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_l_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_leq_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+					 OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							   Page other, int tupnkeyatts, ScanKey key,
+							   ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+								  IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +845,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +915,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTIDCareful(state, itup, P_ISLEAF(topaque));
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +952,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +978,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1039,28 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple  righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTIDCareful(state, righttup,
+														   P_ISLEAF(topaque));
+			}
+
+			if (righttup && !invariant_g_offset(state, righttupnkeyatts,
+												rightkey, rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1103,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1117,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1132,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1321,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1338,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1387,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1437,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts,
+										  targetkey, scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1784,54 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		 itemid;
+		IndexTuple   ritup;
+		int			 uppnkeyatts;
+		ItemPointer  rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1840,93 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we need
+	 * to force to be interpreted as negative infinity, since scan key has to
+	 * be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key,
+							 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		 itemid;
+		IndexTuple   child;
+		int			 uppnkeyatts;
+		ItemPointer  childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid =
+				BTreeTupleGetHeapTIDCareful(state, child, P_ISLEAF(copaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2082,32 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..dc6c65d201 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -34,30 +34,47 @@ Differences to the Lehman & Yao algorithm
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
+The requirement that all btree keys be unique is satisfied by treating
+heap TID as a tie-breaker attribute.  Logical duplicates are sorted in
+descending item pointer order.  We don't use btree keys to
+disambiguate downlinks from the internal pages during a page split,
+though: only one entry in the parent level will be pointing at the
+page we just split, so the link fields can be used to re-find
+downlinks in the parent via a linear search.
 
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+Lehman and Yao require that the key range for a subtree S is described
+by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the
+parent page, but do not account for the need to search the tree based
+only on leading index attributes in a composite index.  Since heap TID
+is always used to make btree keys unique (even in unique indexes),
+every btree index is treated as a composite index internally.  A
+search that finds exact equality to a pivot tuple in an upper tree
+level must descend to the left of that key to ensure it finds any
+equal keys, even when scan values were provided for all attributes.
+An insertion that sees that the high key of its target page is equal
+to the key to be inserted cannot move right, since the downlink for
+the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity
+downlink).
+
+We might be able to avoid moving left in the event of a full match on
+all attributes up to and including the heap TID attribute, but that
+would be a very narrow win, since it's rather unlikely that heap TID
+will be an exact match.  We can avoid moving left unnecessarily when
+all user-visible keys are equal by avoiding exact equality;  a
+sentinel value that's less than any possible heap TID is used by most
+index scans.  This is effective because of suffix truncation.  An
+"extra" heap TID attribute in pivot tuples is almost always avoided.
+All truncated attributes compare as minus infinity, even against a
+sentinel value, and the sentinel value is less than any real TID
+value, so an unnecessary move to the left is avoided regardless of
+whether or not a heap TID is present in the otherwise-equal pivot
+tuple.  Consistently moving left on full equality is also needed by
+page deletion, which re-finds a leaf page by descending the tree while
+searching on the leaf page's high key.  If we wanted to avoid moving
+left without breaking page deletion, we'd have to avoid suffix
+truncation, which could never be worth it.
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +627,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +679,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..2655f0dd84 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,12 +28,28 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+#define MAX_SPLIT_POINTS			50
+
+//#define DEBUG_SPLITS
+#ifdef DEBUG_SPLITS
+#include "catalog/catalog.h"
+#endif
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int				delta;			/* size delta */
+	bool			newitemonleft;	/* new item on left or right of split */
+	OffsetNumber	firstright;		/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
 	Size		newitemsz;		/* size of new item to be inserted */
 	int			fillfactor;		/* needed when splitting rightmost page */
+	Page		page;			/* target page for split */
+	bool		is_duplicates;	/* T if called in many duplicates mode */
 	bool		is_leaf;		/* T if splitting a leaf page */
 	bool		is_rightmost;	/* T if splitting a rightmost page */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
@@ -41,12 +57,9 @@ typedef struct
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplit;		/* Maximum number of splits */
+	int			nvalidsplits;	/* Current number of splits */
+	SplitPoint	splits[MAX_SPLIT_POINTS];	/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +89,19 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
+				 bool manyduplicates, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
 static void _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static int _bt_best_possible_distance(Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, int nvalidsplits,
+				  SplitPoint *splits, bool *needsecondpass);
+static int _bt_distance_for_split(Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split);
+static int _bt_tuple_distance(IndexTuple lastlefttup,
+				  IndexTuple firstrighttup, bool *needsecondpass);
+static bool _bt_split_between_dups(Page page, OffsetNumber firstoldonright);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -113,9 +133,12 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			indnkeyatts;
 	ScanKey		itup_scankey;
+	ItemPointer itup_scantid;
 	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -123,6 +146,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
+	/* we use a heap TID with scan key if this isn't unique case */
+	itup_scantid = (checkUnique == UNIQUE_CHECK_NO ? &itup->t_tid : NULL);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -149,8 +174,6 @@ top:
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -180,8 +203,8 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, indnkeyatts, itup_scankey, itup_scantid,
+							page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +243,8 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, itup_scantid, false,
+						   &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +254,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter of
+	 * the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -250,7 +274,11 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		Assert(itup_scantid == NULL);
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL,
+							 P_FIRSTDATAKEY(lpageop), false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -288,7 +316,7 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
+		/* do the insertion, possibly on a page to the right in unique case */
 		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
 						  stack, heapRel);
 		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
@@ -553,11 +581,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -601,31 +629,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
- *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
- *
  *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		where the new tuple could be inserted if we were to treat it as having
+ *		no implicit heap TID; only callers that just called _bt_check_unique()
+ *		provide this hint (all other callers should set *offsetptr to
+ *		InvalidOffsetNumber).  The caller should hold an exclusive lock on
+ *		*bufptr in all cases.  On exit, they both point to the chosen insert
+ *		location in all cases.  If _bt_findinsertloc decides to move right, the
+ *		lock and pin on the original page will be released, and the new page
+ *		returned to the caller is exclusively locked instead.
+ *
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.
  *
  *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		type scan key for it.  We take a "scantid" heap TID attribute value
+ *		from newtup directly.
  */
 static void
 _bt_findinsertloc(Relation rel,
@@ -641,9 +660,9 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(buf);
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
+	bool		hintinvalidated;
 	OffsetNumber newitemoff;
+	OffsetNumber lowitemoff;
 	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -673,59 +692,30 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/* firstlegaloff/offsetptr hint (if any) assumed valid initially */
+	hintinvalidated = false;
+
+	/*
+	 * TODO: Restore the logic for finding a page to insert on in the event of
+	 * many duplicates for pre-pg_upgrade indexes.  The whole search through
+	 * pages of logical duplicates to determine where to insert seems like
+	 * something that has little upside, but that doesn't make it okay to
+	 * ignore the performance characteristics after pg_upgrade is run, but
+	 * before a REINDEX can run to bump BTREE_VERSION.
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	while (true)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
-		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
-		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
-		 * step right to next non-dead page
+		 * step right to next non-dead page.  this is only needed for unique
+		 * indexes, and pg_upgrade'd indexes that still use BTREE_VERSION 2 or
+		 * 3, where heap TID isn't considered to be a part of the keyspace.
 		 *
 		 * must write-lock that page before releasing write lock on current
 		 * page; else someone else's _bt_check_unique scan could fail to see
@@ -764,24 +754,40 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		hintinvalidated = true;
+	}
+
+	Assert(P_ISLEAF(lpageop));
+
+	/*
+	 * Perform micro-vacuuming of the page we're about to insert tuple on to if
+	 * it looks like it has LP_DEAD items.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		hintinvalidated = true;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Consider using caller's hint to avoid repeated binary search effort.
+	 *
+	 * Note that the hint is only provided by callers that checked uniqueness.
+	 * The hint is used as a lower bound for a new binary search, since
+	 * caller's original binary search won't have specified a scan tid.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
+	if (firstlegaloff == InvalidOffsetNumber || hintinvalidated)
+		lowitemoff = P_FIRSTDATAKEY(lpageop);
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	{
+		Assert(firstlegaloff == _bt_binsrch(rel, buf, keysz, scankey, NULL,
+											P_FIRSTDATAKEY(lpageop), false));
+		lowitemoff = firstlegaloff;
+	}
+
+	newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid,
+							 lowitemoff, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -840,11 +846,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -889,8 +896,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, false,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1132,8 +1139,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1208,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1224,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1244,110 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate attributes of the high key item before inserting it on the left
+	 * page.  This can only happen at the leaf level, since in general all
+	 * pivot tuple values originate from leaf level high keys.  This isn't just
+	 * about avoiding unnecessary work, though; truncating unneeded key suffix
+	 * attributes can only be performed at the leaf level anyway.  This is
+	 * because a pivot tuple in a grandparent page must guide a search not only
+	 * to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber	lastleftoff;
+		IndexTuple		lastleft;
+
+		/*
+		 * Determine which tuple is on the left side of the split point, and
+		 * generate truncated copy of the right tuple.  Truncate as
+		 * aggressively as possible without generating a high key for the left
+		 * side of the split (and later downlink for the right side) that fails
+		 * to distinguish each side.  The new high key needs to be strictly
+		 * less than all tuples on the right side of the split, but can be
+		 * equal to items on the left side of the split.
+		 *
+		 * Handle the case where the incoming tuple is about to become the last
+		 * item on the left side of the split.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+			lastleft = newitem;
+		else
+		{
+			lastleftoff = OffsetNumberPrev(firstright);
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *lastleftstr;
+			char	   *firstrightstr;
+			char	   *newstr;
+
+			index_deform_tuple(lastleft, itupdesc, values, isnull);
+			lastleftstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(item, itupdesc, values, isnull);
+			firstrightstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" leaf block %u "
+				 "last left %s first right %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 lastleftstr, firstrightstr,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)":"",
+				 newstr);
+		}
+#endif
 	}
 	else
+	{
 		lefthikey = item;
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *newhighkey;
+			char	   *newstr;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+			index_deform_tuple(lefthikey, itupdesc, values, isnull);
+			newhighkey = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" internal block %u "
+				 "new high key %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 newhighkey ,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)":"",
+				 newstr);
+		}
+#endif
+	}
+
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1540,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1568,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1589,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1548,6 +1626,17 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point.
+ * With leaf pages, we try to select a point where distinguishing bytes
+ * appear earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  This is also useful with leaf pages that only have a single
+ * attribute, since it's still important to avoid appending an explicit heap
+ * TID attribute to the new pivot tuple (high key/downlink).  The logic for
+ * internal page splits is a little different: we always want to choose the
+ * smallest possible downlink for the next level up that won't split the
+ * page in an unbalanced manner, to delay or prevent root page splits.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1556,6 +1645,19 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * As a last resort, we may call ourselves recursively in many duplicates
+ * mode.  This avoids choosing a split point that will create a high key
+ * that requires a heap TID attribute,  at the cost of an entire second pass
+ * over the page.  This greatly helps with space management in cases where
+ * there are many logical duplicates spanning multiple pages.  Newly
+ * inserted duplicates will tend to have higher heap TID values, so we'll
+ * end up splitting the same page again and again, since the heap TID
+ * attribute has descending sort order.  The page splits will be lopsided,
+ * which will tend to produce space utilization for pages full of duplicates
+ * of around 65% to 80%, rather than 50%.  This is similar to the rightmost
+ * page special case.  Many duplicates mode is almost always avoided in
+ * cases where it won't help.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1563,13 +1665,16 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * We return the index of the first existing tuple that should go on the
  * righthand page, plus a boolean indicating whether the new tuple goes on
  * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
+ * where firstright == newitemoff.  (In many duplicates mode, this could be
+ * InvalidOffsetNumber.)
  */
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 bool manyduplicates,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1581,10 +1686,16 @@ _bt_findsplitloc(Relation rel,
 				rightspace,
 				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
+				olddataitemstoleft,
+				bestdistance,
+				lowestdistance,
+				lowsplit,
+				i;
+	bool		goodenoughfound,
+				needsecondpass;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
@@ -1606,17 +1717,22 @@ _bt_findsplitloc(Relation rel,
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
 	state.newitemsz = newitemsz;
+	state.page = page;
+	state.is_duplicates = manyduplicates;
 	state.is_leaf = P_ISLEAF(opaque);
 	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.nvalidsplits = 0;
 	if (state.is_leaf)
+	{
 		state.fillfactor = RelationGetFillFactor(rel,
 												 BTREE_DEFAULT_FILLFACTOR);
+		state.maxsplit = Min(Max(3, maxoff / 16), MAX_SPLIT_POINTS);
+	}
 	else
+	{
 		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+		state.maxsplit = 1;
+	}
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1639,7 +1755,6 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
@@ -1670,34 +1785,136 @@ _bt_findsplitloc(Relation rel,
 							  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/*
+		 * Abort scan once we find a good-enough choice, and stop adding new
+		 * choices
+		 */
+		if (state.nvalidsplits > 0 && state.splits[0].delta <= goodenough)
 			goodenoughfound = true;
+		if (goodenoughfound && state.nvalidsplits == state.maxsplit &&
+			state.splits[state.maxsplit - 1].delta <= goodenough)
 			break;
-		}
 
 		olddataitemstoleft += itemsz;
 	}
 
+	/*
+	 * If this is a second, many duplicates mode pass, we're only concerned
+	 * with avoiding a split point that requires a heap TID in new pivot
+	 * tuple.  That may not be possible, in which case space utilization
+	 * alone determines choice of split point.
+	 */
+	if (state.is_duplicates)
+	{
+		if (state.nvalidsplits == 0)
+			return InvalidOffsetNumber;
+
+		*newitemonleft = state.splits[0].newitemonleft;
+		return state.splits[0].firstright;
+	}
+
 	/*
 	 * If the new item goes as the last item, check for splitting so that all
 	 * the old items go to the left page and the new item goes to the right
-	 * page.
+	 * page.  We deliberately don't do this in many duplicates mode, since we
+	 * want to avoid "clean breaks" when there are many duplicates.
+	 * (See also: _bt_split_between_dups.)
 	 */
 	if (newitemoff > maxoff && !goodenoughfound)
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
+	 * I believe it is not possible to fail to find a feasible split outside
+	 * of many duplicates mode, but just in case ...
 	 */
-	if (!state.have_split)
+	if (state.nvalidsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	needsecondpass = false;
+	bestdistance = 0;
+	if (state.is_leaf && !state.is_duplicates)
+		bestdistance = _bt_best_possible_distance(page, newitemoff, newitem,
+												  state.nvalidsplits,
+												  state.splits,
+												  &needsecondpass);
+
+	/*
+	 * It looks like we don't have a split point that distinguishes each
+	 * side of the split well enough to even avoid appending a heap TID
+	 * attribute to the new pivot tuple.  Perform a second scan of the page,
+	 * this time in many duplicates mode.
+	 */
+	if (needsecondpass)
+	{
+		OffsetNumber	firstdisjunct;
+
+		Assert(state.is_leaf);
+		Assert(!state.is_duplicates);
+
+		firstdisjunct = _bt_findsplitloc(rel, page, true, newitemoff,
+										 newitemsz - sizeof(ItemIdData),
+										 newitem, newitemonleft);
+
+		/* newitemonleft was set for us */
+		if (firstdisjunct != InvalidOffsetNumber)
+			return firstdisjunct;
+
+		/* Proceed with original plan */
+	}
+
+	lowestdistance = INT_MAX;
+	lowsplit = 0;
+	/* Search among candidate split points for point with lowest distance */
+	for (i = 0; i < state.nvalidsplits; i++)
+	{
+		int			distance;
+
+		distance = _bt_distance_for_split(page, newitemoff, newitem,
+										  state.splits + i);
+
+		if (distance <= bestdistance)
+		{
+			/*
+			 * Split point's distance is as good as the best possible split
+			 * point, so we can return early, avoiding uselessly calculating
+			 * further distances.
+			 *
+			 * This optimization is very important for common cases such as
+			 * primary key indexes on auto-incremented or monotonically
+			 * increasing values.  They'll often only incur two
+			 * _bt_tuple_distance() calls; one to establish the
+			 * best/lowest distance, and another that determines that the
+			 * delta-optimal split point (the first entry in the splits
+			 * array) also has the best/lowest possible distance.
+			 */
+#ifdef DEBUG_SPLITS
+			elog(LOG, "\"%s\" %s best distance %d returned at point %d out of %d",
+				 RelationGetRelationName(rel), state.is_leaf ? "leaf" : "internal",
+				 bestdistance, i, state.nvalidsplits);
+#endif
+			lowestdistance = distance;
+			lowsplit = i;
+			break;
+		}
+
+		/*
+		 * Remember the earliest (lowest delta) appearance of the overall
+		 * lowest distance candidate split point
+		 */
+		if (distance < lowestdistance)
+		{
+			lowestdistance = distance;
+			lowsplit = i;
+		}
+	}
+
+#ifdef DEBUG_SPLITS
+	elog(LOG, "\"%s\" split location %d out of %u",
+		 RelationGetRelationName(rel), state.splits[lowsplit].firstright, maxoff);
+#endif
+	*newitemonleft = state.splits[lowsplit].newitemonleft;
+	return state.splits[lowsplit].firstright;
 }
 
 /*
@@ -1759,10 +1976,15 @@ _bt_checksplitloc(FindSplitData *state,
 	/*
 	 * If we are not on the leaf level, we will be able to discard the key
 	 * data from the first item that winds up on the right page.
+	 *
+	 * If we are on the leaf level, conservatively assume that suffix
+	 * truncation cannot avoid adding a heap TID.
 	 */
 	if (!state->is_leaf)
 		rightfree += (int) firstrightitemsz -
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+	else
+		leftfree -= MAXALIGN(sizeof(ItemPointerData));
 
 	/*
 	 * If feasible split point, remember best delta.
@@ -1788,16 +2010,258 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		if (state->nvalidsplits < state->maxsplit ||
+			delta < state->splits[state->nvalidsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/* In many duplicates mode, split may be unusable */
+			if (unlikely(state->is_duplicates &&
+						 _bt_split_between_dups(state->page, newsplit.firstright)))
+				return;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number
+			 * of split points.
+			 */
+			if (state->nvalidsplits < state->maxsplit)
+				state->nvalidsplits++;
+
+			/*
+			 * Replace the final item in the nvalidsplits-wise array.  The
+			 * final item is either a garbage still-uninitialized entry, or
+			 * the most marginal real entry when we already have as many
+			 * split points as we're willing to consider.
+			 */
+			for (j = state->nvalidsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
 		}
 	}
 }
 
+/*
+ * Subroutine to find the best possible "distance" in a list of candidate
+ * split points.
+ *
+ * If a candidate split point matches the best possible split point, it
+ * cannot be beat, and we can stop considering any other split point.
+ */
+static int
+_bt_best_possible_distance(Page page, OffsetNumber newitemoff,
+						   IndexTuple newitem, int nvalidsplits,
+						   SplitPoint *splits, bool *needsecondpass)
+{
+	int				distance;
+	OffsetNumber	mid, left;
+	IndexTuple		lefttup, righttup;
+	ItemId			itemid;
+	int				j;
+
+	*needsecondpass = false;
+
+	lefttup = NULL;
+	righttup = NULL;
+	mid = splits[0].firstright;
+	distance = 0;
+
+	/*
+	 * Find two split points that are furthest apart.
+	 *
+	 * Iterate backwards in split array, possibly until second entry in
+	 * array is considered.
+	 */
+	for (j = nvalidsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = splits + j;
+
+		if (!righttup && split->firstright >= mid)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				righttup = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				righttup = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+		else if (!lefttup && split->firstright <= mid)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+			{
+				left = split->firstright;
+				lefttup = newitem;
+			}
+			else
+			{
+				left = OffsetNumberPrev(split->firstright);
+				itemid = PageGetItemId(page, left);
+				lefttup = (IndexTuple) PageGetItem(page, itemid);
+			}
+			/* remember the left of last left */
+			left = OffsetNumberPrev(left);
+		}
+
+		if (lefttup && righttup)
+		{
+			distance = _bt_tuple_distance(lefttup, righttup, needsecondpass);
+			break;
+		}
+	}
+
+	return distance;
+}
+
+/*
+ * Subroutine to find the "distance" between the two tuples that enclose a
+ * candidate split point.
+ *
+ * Distance is an offset into tuple data, and is only used for leaf pages.
+ */
+static int
+_bt_distance_for_split(Page page, OffsetNumber newitemoff,
+					   IndexTuple newitem, SplitPoint *split)
+{
+	ItemId			itemid;
+	IndexTuple		lastlefttup;
+	IndexTuple		firstrighttup;
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstrighttup = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstrighttup = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastlefttup = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastlefttup = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastlefttup != firstrighttup);
+	return _bt_tuple_distance(lastlefttup, firstrighttup, NULL);
+}
+
+/*
+ * _bt_tuple_distance - what point does the first difference appear among a
+ * pair of leaf tuples?
+ *
+ * This is used when determining which split point among the list of
+ * acceptable split points has the earliest difference.  Tuples with earlier
+ * differences are more likely to have suffix attributes truncated away.
+ *
+ * The approach used here in inherently approximate, since we don't want to
+ * go to the expense of using ordinary scan key comparisons.
+ */
+static int
+_bt_tuple_distance(IndexTuple lastlefttup, IndexTuple firstrighttup,
+				   bool *needsecondpass)
+{
+	char	   *left, *right;
+	Size		leftsz, rightsz, datasz;
+	int			result;
+
+	/*
+	 * We deliberately start our comparison in the NULL bitmap, if any.
+	 *
+	 * FIXME: What about covering index pivot tuples?  Need to exclude non-key
+	 * suffix attributes that are always truncated away later.
+	 */
+	leftsz = IndexTupleSize(firstrighttup) - sizeof(IndexTupleData);
+	rightsz = IndexTupleSize(lastlefttup) - sizeof(IndexTupleData);
+	datasz = Min(leftsz, rightsz);
+
+	left = (char *) lastlefttup + sizeof(IndexTupleData);
+	right = (char *) firstrighttup + sizeof(IndexTupleData);
+	for (result = 0; result < datasz; result++)
+	{
+		if (left[result] != right[result])
+			break;
+	}
+
+	/* A second "many duplicates" pass may be required */
+	if (leftsz == rightsz && result == leftsz)
+	{
+		if (needsecondpass)
+			*needsecondpass = true;
+	}
+
+	return result;
+}
+
+/*
+ * _bt_split_between_dups - should we skip this split point as a duplicate?
+ *
+ * This is used in many duplicates mode only.  Note that we deliberately don't
+ * consider candidate splits where either side of the split point is actually
+ * the new item rather than an existing item on the page, unlike
+ * _bt_distance_for_split().  This results in many duplicates mode never
+ * splitting the page immediately before the point where the large group of
+ * duplicates begins.
+ *
+ * If a split was allowed to take place that cleanly separated the large group
+ * of duplicates on to their own leftmost page, continual lopsided splits of
+ * the original leftmost page could not take place.  Ultimately, duplicate
+ * pages would have space utilization of about 50%, which is exactly what we're
+ * trying to avoid.  (Maybe it would be possible to recognize the leftmost page
+ * in a series of pages entirely full of the same duplicate value by using a
+ * "low key", but that doesn't seem worth it.)
+ */
+static bool
+_bt_split_between_dups(Page page, OffsetNumber firstoldonright)
+{
+	ItemId			itemid;
+	IndexTuple		lastleft;
+	IndexTuple		firstright;
+	char		   *left;
+	char		   *right;
+	Size			datasz;
+	BTPageOpaque	opaque;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	itemid = PageGetItemId(page, firstoldonright);
+	firstright = (IndexTuple) PageGetItem(page, itemid);
+
+	/*
+	 * Add a little slack, so that pages full of duplicates are not always as
+	 * packed as possible.  It seems like we're justified in packing things
+	 * very tight, but not using up all free space.
+	 */
+	if (OffsetNumberPrev(firstoldonright) <= P_FIRSTDATAKEY(opaque))
+		return false;
+
+	/*
+	 * FIXME: What about covering index pivot tuples?  Clearly using
+	 * memcmp() there does not approximate comparing key attributes very
+	 * well at all.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(firstoldonright));
+	lastleft = (IndexTuple) PageGetItem(page, itemid);
+	left = (char *) lastleft + sizeof(IndexTupleData);
+	right = (char *) firstright + sizeof(IndexTupleData);
+	datasz = IndexTupleSize(firstright) - sizeof(IndexTupleData);
+
+	if (IndexTupleSize(lastleft) == IndexTupleSize(firstright) &&
+		memcmp(left, right, datasz) == 0)
+		return true;
+
+	return false;
+}
+
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
@@ -2199,7 +2663,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2311,8 +2776,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2326,12 +2791,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..f63615341c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1421,10 +1421,12 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
+				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..0bfe94dd47 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -69,11 +69,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 
 
 /*
- *	_bt_search() -- Search the tree for a particular scankey,
+ *	_bt_search() -- Search the tree for a particular scankey + scantid,
  *		or more precisely for the first leaf page it could be on.
  *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * but it can omit the rightmost column(s) of the index.  The scantid
+ * argument may also be omitted (caller passes NULL), since it's logically
+ * the "real" rightmost attribute.
  *
  * When nextkey is false (the usual case), we are looking for the first
  * item >= scankey.  When nextkey is true, we are looking for the first
@@ -94,8 +96,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +147,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid,
+							 P_FIRSTDATAKEY(opaque), nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +161,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * faster than comparing the keys themselves.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,7 +202,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  true, stack_in, BT_WRITE, snapshot);
 	}
 
@@ -245,6 +248,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -305,7 +309,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -337,6 +341,12 @@ _bt_moveright(Relation rel,
  * particular, this means it is possible to return a value 1 greater than the
  * number of keys on the page, if the scankey is > all keys on the page.)
  *
+ * Caller passes own low value for binary search.  This can be used to
+ * resume a partial binary search without repeated effort.  _bt_check_unique
+ * callers use this to avoid repeated work.  This only works when a buffer
+ * lock is held throughout, and we're passed a leaf page both times, and
+ * nextkey is false.
+ *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
  * is true.  (Since _bt_compare treats the first data key of such a page as
@@ -354,19 +364,19 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
+			OffsetNumber low,
 			bool nextkey)
 {
 	Page		page;
 	BTPageOpaque opaque;
-	OffsetNumber low,
-				high;
+	OffsetNumber high;
 	int32		result,
 				cmpval;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
 
 	/*
@@ -401,7 +411,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -431,6 +441,50 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey.  The actual key value stored (if any, which there probably isn't)
+ * does not matter.  This convention allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first key.
+ * See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+			int keysz,
+			ScanKey scankey,
+			ItemPointer scantid,
+			Page page,
+			OffsetNumber offnum)
+{
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	/*
+	 * Force result ">" if target item is first data item on an internal page
+	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
+	 */
+	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+		return 1;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	return _bt_tuple_compare(rel, keysz, scankey, scantid, itup);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
@@ -445,37 +499,23 @@ _bt_binsrch(Relation rel,
  *		NULLs in the keys are treated as sortable values.  Therefore
  *		"equality" does not necessarily mean that the item should be
  *		returned to the caller as a matching key!
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
  *----------
  */
 int32
-_bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
-			Page page,
-			OffsetNumber offnum)
+_bt_tuple_compare(Relation rel,
+				  int keysz,
+				  ScanKey scankey,
+				  ItemPointer scantid,
+				  IndexTuple itup)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	IndexTuple	itup;
+	ItemPointer	heapTid;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
 
-	Assert(_bt_check_natts(rel, page, offnum));
-
-	/*
-	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
-	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
-
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +529,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +581,35 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+
+#ifndef BTREE_ASC_HEAP_TID
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	return ItemPointerCompare(heapTid, scantid);
+#else
+	return ItemPointerCompare(scantid, heapTid);
+#endif
 }
 
 /*
@@ -570,6 +638,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
+	BTPageOpaque opaque;
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
@@ -577,6 +646,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	bool		goback;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ItemPointer scantid;
+	ItemPointerData minscantid;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -826,6 +897,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * scankeys[] array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scantid = NULL;
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -962,6 +1034,37 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/*
+	 * When all key attributes will be in insertion scankey, manufacture
+	 * sentinel scan tid that's less than any possible heap TID in the index.
+	 * This is still greater than minus infinity to _bt_compare, allowing
+	 * _bt_search to follow a downlink with scankey-equal attributes, but a
+	 * truncated-away heap TID.
+	 *
+	 * If we didn't do this then affected index scans would have to
+	 * unnecessarily visit an extra page before moving right to the page they
+	 * should have landed on from the parent in the first place.
+	 *
+	 * (Note that implementing this by adding hard-coding to _bt_compare is
+	 * unworkable, since some _bt_search callers need to re-find a leaf page
+	 * using the page's high key.)
+	 */
+	if (keysCount >= IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		scantid = &minscantid;
+
+#ifndef BTREE_ASC_HEAP_TID
+		/* Heap TID attribute uses DESC ordering */
+		ItemPointerSetBlockNumber(scantid, InvalidBlockNumber);
+		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+#else
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(scantid, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+#endif
+	}
+
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
@@ -1054,11 +1157,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/*
-	 * Use the manufactured insertion scan key to descend the tree and
-	 * position ourselves on the target leaf page.
+	 * Use the manufactured insertion scan key (and possibly a scantid) to
+	 * descend the tree and position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, keysCount, scankeys, scantid, nextkey, &buf,
+					   BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1190,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(BufferGetPage(buf));
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, scantid,
+						 P_FIRSTDATAKEY(opaque), nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..6133d3a5f4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -796,8 +796,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -880,19 +878,28 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
-			IndexTuple	truncated;
-			Size		truncsz;
+			IndexTuple		lastleft;
+			IndexTuple		truncated;
+			Size			truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
-			 * in internal pages are either negative infinity items, or get
-			 * their contents from copying from one level down.  See also:
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks in
+			 * internal pages are either negative infinity items, or get their
+			 * contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_suffix_truncate() can truncate away more
+			 * attributes, whereas the split point passed to _bt_split() is
+			 * chosen much more delicately.  Suffix truncation is mostly useful
+			 * because it can greatly improve space utilization for workloads
+			 * with random insertions.  It doesn't seem worthwhile to add
+			 * complex logic for choosing a split point here for a benefit that
+			 * is bound to be much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +912,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_suffix_truncate(wstate->index, lastleft, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +934,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +981,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1040,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1127,6 +1139,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1148,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1164,25 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all keys
+				 * in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+#ifndef BTREE_ASC_HEAP_TID
+					/* Deliberately invert the order, since TIDs "sort DESC" */
+					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+#else
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+#endif
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4528e87c83..f133c5a2f1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+						   IndexTuple firstright);
 
 
 /*
@@ -56,27 +58,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +105,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple
+		 * due to suffix truncation.  Keys built from truncated attributes
+		 * are defensively represented as NULL values, though they should
+		 * still not be allowed to participate in comparisons (caller must
+		 * be sure to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2106,186 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to an insertion scan
+ * key).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * Note that returned tuple's t_tid offset will hold the number of
+ * attributes present, so the original item pointer offset is not
+ * represented.  Caller should only change truncated tuple's downlink.  Note
+ * also that truncated key attributes are treated as containing "minus
+ * infinity" values by _bt_compare()/_bt_tuple_compare().  Returned tuple is
+ * guaranteed to be no larger than the original plus some extra space for a
+ * possible extra heap TID tie-breaker attribute, which is important for
+ * staying under the 1/3 of a page restriction on tuple size.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc		itupdesc = RelationGetDescr(rel);
+	int16			natts = IndexRelationGetNumberOfAttributes(rel);
+	int16			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int				leavenatts;
+	IndexTuple		pivot;
+	ItemPointer		pivotheaptid;
+	Size			newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple		tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where we
+		 * can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).  We should only call index_truncate_tuple() when user
+		 * attributes need to be truncated.
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key space,
+		 * so it's still necessary to add a heap TID attribute to the new pivot
+		 * tuple.  Create enlarged copy of our truncated right tuple copy, to
+		 * fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since attributes are all equal.  It's
+		 * necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * Create enlarged copy of first right tuple to fit heap TID.  We must
+	 * use heap TID as a unique-ifier in new pivot tuple, since no user key
+	 * attribute distinguishes which values belong on each side of the split
+	 * point.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Generate a heap TID value for the new pivot tuple.  Simply use last
+	 * left heap TID as new pivot's heap TID value.  Only cases where there
+	 * are a large number of duplicates will get a heap TID added to new pivot
+	 * tuples, since the logic for picking a split point works hard to avoid
+	 * having the split end up here.
+	 *
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on all current and future items on the right page
+	 * (this will be copied from the new high key for the left side of the
+	 * split).  Manufacturing a "median TID" value barely affects space
+	 * utilization, so we don't bother.
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  MAXALIGN(sizeof(ItemPointerData)));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+#ifndef BTREE_ASC_HEAP_TID
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
+	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+#else
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	/* Mark tuple as containing all key attributes, plus heap TID attribute */
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (no truncated user attributes) case, since caller
+	 * cares about whether or not it can avoid appending a heap TID as a
+	 * unique-ifier
+	 */
+	leavenatts = 1;
+	for (;;)
+	{
+		if (leavenatts > nkeyatts)
+			break;
+		if (_bt_tuple_compare(rel, leavenatts, skey, NULL, lastleft) > 0)
+			break;
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2308,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2328,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2338,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2349,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2382,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..7c061e96d2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..ffeb0624fe 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -854,10 +854,8 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 	 * PageIndexTupleDelete is the best way.  Delete the items in reverse
 	 * order so we don't have to think about adjusting item numbers for
 	 * previous deletions.
-	 *
-	 * TODO: tune the magic number here
 	 */
-	if (nitems <= 2)
+	if (nitems <= 7)
 	{
 		while (--nitems >= 0)
 			PageIndexTupleDelete(page, itemnos[nitems]);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9035..966c64eabf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,23 +4057,35 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required
+	 * for btree indexes, since heap TID is treated as an implicit last
+	 * key attribute in order to ensure that all keys in the index are
+	 * physically unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
-
+#ifndef BTREE_ASC_HEAP_TID
+		/* Deliberately invert the order, since TIDs "sort DESC" */
+		if (blk1 != blk2)
+			return (blk1 < blk2) ? 1 : -1;
+#else
 		if (blk1 != blk2)
 			return (blk1 < blk2) ? -1 : 1;
+#endif
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
+#ifndef BTREE_ASC_HEAP_TID
+		/* Deliberately invert the order, since TIDs "sort DESC" */
+		if (pos1 != pos2)
+			return (pos1 < pos2) ? 1 : -1;
+#else
 		if (pos1 != pos2)
 			return (pos1 < pos2) ? -1 : 1;
+#endif
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..c06560d9ae 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -114,16 +114,27 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
+/* FIXME: Support versions 2 and 3 for the benefit of pg_upgrade users */
+#define BTREE_VERSION	4		/* current version number */
+#define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -204,12 +215,11 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.  A pivot tuple must have INDEX_ALT_TID_MASK set
+ * as of BTREE_VERSION 4, however.
  *
  * The 12 least significant offset bits are used to represent the number of
  * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
@@ -219,6 +229,8 @@ typedef struct BTMetaPageData
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented at end of tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +253,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +270,58 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Heap TID behaves as a final key value within nbtree, in order to ensure
+ * that all entries keys are unique and relocatable.  By default, heap TIDs
+ * are sorted in DESC sort order within nbtree indexes.  ASC heap TID ordering
+ * may be useful during testing.
+ *
+ * DESC order results in superior space utilization when inserting many
+ * logical duplicates into an index.  Continual leaf page splits will split
+ * the same block again and again, because more recently inserted items sort
+ * lower, not higher.  The page split logic is biased to make the first page
+ * among pages full of logical duplicates have a small number of values that
+ * are less than the duplicated value.  The overall effect is a bit like a
+ * right-heavy page split.
+#define BTREE_ASC_HEAP_TID
+ */
+
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -560,15 +621,18 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, OffsetNumber low,
+			bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, int keysz, ScanKey scankey,
+							   ItemPointer scantid, IndexTuple itup);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -601,7 +665,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, IndexTuple lastleft,
+									  IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..5f3c4a015a 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -82,20 +81,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a9041b0..f4899f2a38 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -643,10 +643,10 @@ update domnotnull set col1 = null; -- fails
 ERROR:  domain dnotnulltest does not allow null values
 alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+\set VERBOSITY default
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index b90c4926e2..9e809269e0 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -253,13 +253,13 @@ SELECT * FROM FKTABLE;
 (5 rows)
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 --
 -- First test, check with no on delete or on update
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..2c20cea4b9 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5896,8 +5896,8 @@ inner join j1 j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
  id1 | id2 | id1 | id2 
 -----+-----+-----+-----
-   1 |   1 |   1 |   1
    1 |   2 |   1 |   2
+   1 |   1 |   1 |   1
 (2 rows)
 
 reset enable_nestloop;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 2e26510522..c8b9a71689 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -276,11 +276,10 @@ SELECT * FROM trunc_faa;
 (0 rows)
 
 ROLLBACK;
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
-drop cascades to table trunc_faa
-drop cascades to table trunc_fb
+\set VERBOSITY default
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..c76efee358 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -75,19 +75,12 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 ERROR:  column "name" specified more than once
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
-table persons2 depends on type person_type
-table persons3 depends on type person_type
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
-drop cascades to table persons2
-drop cascades to table persons3
+\set VERBOSITY default
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index e64d693e9c..1ea90181d8 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
diff --git a/src/test/regress/sql/domain.sql b/src/test/regress/sql/domain.sql
index 68da27de22..d19e2c9d28 100644
--- a/src/test/regress/sql/domain.sql
+++ b/src/test/regress/sql/domain.sql
@@ -381,7 +381,9 @@ alter domain dnotnulltest drop not null;
 
 update domnotnull set col1 = null;
 
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
+\set VERBOSITY default
 
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sql/foreign_key.sql
index f3e7058583..8fbec438b5 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -159,9 +159,11 @@ UPDATE PKTABLE SET ptest1=1 WHERE ptest1=2;
 SELECT * FROM FKTABLE;
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 DROP TABLE PKTABLE CASCADE;
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 
 
 --
diff --git a/src/test/regress/sql/truncate.sql b/src/test/regress/sql/truncate.sql
index 6ddfb6dd1d..fee7e76ec3 100644
--- a/src/test/regress/sql/truncate.sql
+++ b/src/test/regress/sql/truncate.sql
@@ -125,7 +125,9 @@ SELECT * FROM trunc_fa;
 SELECT * FROM trunc_faa;
 ROLLBACK;
 
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
+\set VERBOSITY default
 
 -- Test ON TRUNCATE triggers
 
diff --git a/src/test/regress/sql/typed_table.sql b/src/test/regress/sql/typed_table.sql
index 9ef0cdfcc7..953cd1f14b 100644
--- a/src/test/regress/sql/typed_table.sql
+++ b/src/test/regress/sql/typed_table.sql
@@ -43,8 +43,10 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 DROP TYPE person_type CASCADE;
+\set VERBOSITY default
 
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index dc6d5cbe35..6eaa81b540 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
-- 
2.17.1

#16

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#15)

3 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Attached is v5, which significantly simplifies the _bt_findsplitloc()
logic. It's now a great deal easier to follow. It would be helpful if
someone could do code-level review of the overhauled
_bt_findsplitloc(). That's the most important part of the patch. It
involves relatively subjective trade-offs around total effort spent
during a page split, space utilization, and avoiding "false sharing"
(I call the situation where a range of duplicate values straddle two
leaf pages unnecessarily "false sharing", since it necessitates that
subsequent index scans visit two index scans rather than just one,
even when that's avoidable.)

This version has slightly improved performance, especially for cases
where an index gets bloated without any garbage being generated. With
the UK land registry data [1]https://wiki.postgresql.org/wiki/Sample_Databases -- Peter Geoghegan, an index on (county, city, locality) is
shrunk by just over 18% by the new logic (I recall that it was shrunk
by ~15% in an earlier version). In concrete terms, it goes from being
1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is
mostly because the patch intelligently packs together duplicate-filled
pages tightly (in particular, it avoids "getting tired"), but also
because it makes pivots less restrictive about where leaf tuples can
go. I still manage to shrink the largest TPC-C and TPC-H indexes by at
least 5% following an initial load performed by successive INSERTs.
Those are unique indexes, so the benefits are certainly not limited to
cases involving many duplicates.

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode. The higher level split code always asks for a default mode
call to _bt_findsplitloc(), so that's always where we start. For leaf
page splits, _bt_findsplitloc() will occasionally call itself
recursively in either many duplicates mode or single value mode. This
happens when the default strategy doesn't work out.

* Default mode almost does what we do already, but remembers the top n
candidate split points, sorted by the delta between left and right
post-split free space, rather than just looking for the overall lowest
delta split point.

Then, we go through a second pass over the temp array of "acceptable"
split points, that considers the needs of suffix truncation.

* Many duplicates mode is used when we fail to find a "distinguishing"
split point in regular mode, but have determined that it's possible to
get one if a new, exhaustive search is performed.

We go to great lengths to avoid having to append a heap TID to the new
left page high key -- that's what I mean by "distinguishing". We're
particularly concerned with false sharing by subsequent point lookup
index scans here.

* Single value mode is used when we see that even many duplicates mode
would be futile, as the leaf page is already *entirely* full of
logical duplicates.

Single value mode isn't exhaustive, since there is clearly nothing to
exhaustively search for. Instead, it packs together as many tuples as
possible on the right side of the split. Since heap TIDs sort in
descending order, this is very much like a "leftmost" split that tries
to free most of the space on the left side, and pack most of the page
contents on the right side. Except that it's leftmost, and in
particular is leftmost among pages full of logical duplicates (as
opposed to being leftmost/rightmost among pages on an entire level of
the tree, as with the traditional rightmost 90:10 split thing).

Other changes
-------------

* I now explicitly use fillfactor in the manner of a rightmost split
to get the single value mode behavior.

I call these types of splits (rightmost and single value mode splits)
"weighted" splits in the patch. This is much more consistent with our
existing conventions than my previous approach.

* Improved approached to inexpensively determining how effective
suffix truncation will be for a given candidate split point.

I no longer naively probe the contents of index tuples to do char
comparisons. Instead, I use a tuple descriptor to get offsets to each
attribute in each tuple in turn, then calling to datumIsEqual() to
determine if they're equal. This is almost as good as a full scan key
comparison. This actually seems to be a bit faster, and also takes
care of INCLUDE indexes without special care (no need to worry about
probing non-key attributes, and reaching a faulty conclusion about
which split point helps with suffix truncation).

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

[1]: https://wiki.postgresql.org/wiki/Sample_Databases -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v5-0003-Allow-nbtree-to-use-ASC-heap-TID-attribute-order.patchapplication/octet-stream; name=v5-0003-Allow-nbtree-to-use-ASC-heap-TID-attribute-order.patchDownload

From cc11b7e5d4cc950a0adfa83e98335e1252187ec2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 18 Sep 2018 18:25:45 -0700
Subject: [PATCH v5 3/3] Allow nbtree to use ASC heap TID attribute order.

When the macro BTREE_ASC_HEAP_TID is defined (uncommented), the patch
will change the implementation to use ASC sort order rather than DESC
sort order.  This may be useful to reviewers.
---
 src/backend/access/nbtree/nbtinsert.c |  4 ++++
 src/backend/access/nbtree/nbtsearch.c | 11 +++++++++++
 src/backend/access/nbtree/nbtsort.c   |  4 ++++
 src/backend/access/nbtree/nbtutils.c  | 12 ++++++++++++
 src/backend/utils/sort/tuplesort.c    | 10 ++++++++++
 src/include/access/nbtree.h           | 22 ++++++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 598e702bf1..507ed00373 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2222,7 +2222,11 @@ _bt_perfect_firstdiff(Relation rel, Page page, OffsetNumber newitemoff,
 		 */
 		if (identical)
 		{
+#ifndef BTREE_ASC_HEAP_TID
 			if (P_FIRSTDATAKEY(opaque) == newitemoff)
+#else
+			if (maxoff < newitemoff)
+#endif
 				*secondmode = SPLIT_SINGLE_VALUE;
 			else
 			{
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c229b7eed2..6c149113c8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -604,8 +604,12 @@ _bt_tuple_compare(Relation rel,
 	if (!heapTid)
 		return 1;
 
+#ifndef BTREE_ASC_HEAP_TID
 	/* Deliberately invert the order, since TIDs "sort DESC" */
 	return ItemPointerCompare(heapTid, scantid);
+#else
+	return ItemPointerCompare(scantid, heapTid);
+#endif
 }
 
 /*
@@ -1053,9 +1057,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		scantid = &minscantid;
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Heap TID attribute uses DESC ordering */
 		ItemPointerSetBlockNumber(scantid, InvalidBlockNumber);
 		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+#else
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(scantid, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+#endif
 	}
 
 	/*----------
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e8f506cc09..1a62683ee8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1175,8 +1175,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				 */
 				if (compare == 0)
 				{
+#ifndef BTREE_ASC_HEAP_TID
 					/* Deliberately invert the order, since TIDs "sort DESC" */
 					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+#else
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+#endif
 					Assert(compare != 0);
 					if (compare > 0)
 						load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index f9f3ec7914..8c8fdd62f4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2242,7 +2242,14 @@ _bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  MAXALIGN(sizeof(ItemPointerData)));
+#ifndef BTREE_ASC_HEAP_TID
 	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+#else
+	/* Manufacture TID that's less than right TID, but only minimally */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+#endif
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2252,9 +2259,14 @@ _bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 	 * split).
 	 */
 
+#ifndef BTREE_ASC_HEAP_TID
 	/* Deliberately invert the order, since TIDs "sort DESC" */
 	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
 	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+#else
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
 	BTreeTupleSetAltHeapTID(pivot);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5211cf5b98..1d9a2602d9 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4066,17 +4066,27 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (blk1 != blk2)
 			return (blk1 < blk2) ? 1 : -1;
+#else
+		if (blk1 != blk2)
+			return (blk1 < blk2) ? -1 : 1;
+#endif
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (pos1 != pos2)
 			return (pos1 < pos2) ? 1 : -1;
+#else
+		if (pos1 != pos2)
+			return (pos1 < pos2) ? -1 : 1;
+#endif
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 12f57773e7..c686cb7e47 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -117,6 +117,24 @@ typedef struct BTMetaPageData
 #define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
+/*
+ * Heap TID behaves as a final key value within nbtree as of
+ * BTREE_VERSION 4.  This ensures that all entries keys are unique
+ * and relocatable.  By default, heap TIDs are sorted in DESC sort
+ * order within nbtree indexes.  ASC heap TID ordering may be
+ * useful during testing.
+ *
+ * DESC order was chosen because it allowed BTREE_VERSION 4 to
+ * maintain compatibility with unspecified BTREE_VERSION 2 + 3
+ * behavior that dependency management nevertheless relied on.
+ * However, DESC order also seems like it might be slightly better
+ * on its own merits, since continually splitting the same leaf
+ * page may cut down on the total number of FPIs generated when
+ * continually inserting tuples with the same user-visible
+ * attribute values.
+#define BTREE_ASC_HEAP_TID
+ */
+
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
@@ -151,7 +169,11 @@ typedef struct BTMetaPageData
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#ifndef BTREE_ASC_HEAP_TID
 #define BTREE_SINGLEVAL_FILLFACTOR	1
+#else
+#define BTREE_SINGLEVAL_FILLFACTOR	99
+#endif
 
 /*
  *	In general, the btree code tries to localize its knowledge about
-- 
2.17.1

v5-0002-Add-temporary-page-split-debug-LOG-instrumentatio.patchapplication/octet-stream; name=v5-0002-Add-temporary-page-split-debug-LOG-instrumentatio.patchDownload

From fce43ad9d1f85e793bbcfc21efd796fbe4fa9b2a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 18 Sep 2018 18:18:55 -0700
Subject: [PATCH v5 2/3] Add temporary page split debug LOG instrumentation.

LOGs details on the new left high key in the event of a leaf page split.
This is an easy way to directly observe the effectiveness of suffix
truncation as it happens, which was useful during development.  The
macro DEBUG_SPLITS must be defined (uncommented) for the instrumentation
to be enabled.
---
 src/backend/access/nbtree/nbtinsert.c | 69 +++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0cb8bb1816..598e702bf1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -27,6 +27,11 @@
 #include "utils/datum.h"
 #include "utils/tqual.h"
 
+/* #define DEBUG_SPLITS */
+#ifdef DEBUG_SPLITS
+#include "catalog/catalog.h"
+#endif
+
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 #define STACK_SPLIT_POINTS			15
@@ -1287,9 +1292,64 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *lastleftstr;
+			char	   *firstrightstr;
+			char	   *newstr;
+
+			index_deform_tuple(lastleft, itupdesc, values, isnull);
+			lastleftstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(item, itupdesc, values, isnull);
+			firstrightstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" leaf block %u "
+				 "last left %s first right %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 lastleftstr, firstrightstr,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)" : "",
+				 newstr);
+		}
+#endif
 	}
 	else
+	{
 		lefthikey = item;
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *newhighkey;
+			char	   *newstr;
+
+			index_deform_tuple(lefthikey, itupdesc, values, isnull);
+			newhighkey = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" internal block %u "
+				 "new high key %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 newhighkey,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)" : "",
+				 newstr);
+		}
+#endif
+	}
 
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
@@ -1861,6 +1921,11 @@ _bt_findsplitloc(Relation rel,
 
 		if (firstdiff <= perfectfirstdiff)
 		{
+#ifdef DEBUG_SPLITS
+			elog(LOG, "\"%s\" perfect firstdiff %d returned at point %d out of %d",
+				 RelationGetRelationName(rel), perfectfirstdiff, i,
+				 state.nsplits);
+#endif
 			bestfirstdiff = firstdiff;
 			lowsplit = i;
 			break;
@@ -1879,6 +1944,10 @@ _bt_findsplitloc(Relation rel,
 	if (state.splits != splits)
 		pfree(state.splits);
 
+#ifdef DEBUG_SPLITS
+	elog(LOG, "\"%s\" split location %d out of %u",
+		 RelationGetRelationName(rel), finalfirstright, maxoff);
+#endif
 	return finalfirstright;
 }
 
-- 
2.17.1

v5-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchapplication/octet-stream; name=v5-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchDownload

From f471aa15ffb79421bdb1db9c532aba82115f0b34 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v5 1/3] Make nbtree indexes have unique keys in tuples.

Make nbtree treat all index tuples as having a heap TID trailing
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, though for now this is only used by insertions that need to find a
leaf page to insert a tuple on.  This general approach has numerous
benefits for performance, and may enable a later enhancement that has
nbtree vacuuming perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is also introduced.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though this is of
secondary importance.  Truncation can only occur at the attribute
granularity, which isn't particularly effective, but works well enough
for now.

We completely remove the logic that allows a search for free space among
multiple pages full of duplicates to "get tired".  This has significant
benefits for free space management in secondary indexes on low
cardinality attributes.  Unique checking still has to start with the
first page that its heap-TID-free insertion scan key leads it to, though
insertion can then quickly find the leaf page and offset its new tuple
unambiguously belongs at (in the unique case there will rarely be
multiple pages full of duplicates, so being unable to descend the tree
to directly find the insertion target leaf page will seldom be much of a
problem).

Note that this version of the patch doesn't yet deal with on-disk
compatibility issues.  That will follow in a later revision.
---
 contrib/amcheck/verify_nbtree.c               | 259 ++++--
 contrib/pageinspect/expected/btree.out        |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out  |  10 +-
 src/backend/access/nbtree/README              | 114 ++-
 src/backend/access/nbtree/nbtinsert.c         | 862 ++++++++++++++----
 src/backend/access/nbtree/nbtpage.c           |   8 +-
 src/backend/access/nbtree/nbtsearch.c         | 190 +++-
 src/backend/access/nbtree/nbtsort.c           |  56 +-
 src/backend/access/nbtree/nbtutils.c          | 244 ++++-
 src/backend/access/nbtree/nbtxlog.c           |  41 +-
 src/backend/access/rmgrdesc/nbtdesc.c         |   8 -
 src/backend/storage/page/bufpage.c            |   4 +-
 src/backend/utils/sort/tuplesort.c            |  13 +-
 src/include/access/nbtree.h                   |  97 +-
 src/include/access/nbtxlog.h                  |  19 +-
 src/test/regress/expected/domain.out          |   4 +-
 src/test/regress/expected/foreign_key.out     |   4 +-
 src/test/regress/expected/join.out            |   2 +-
 src/test/regress/expected/truncate.out        |   5 +-
 src/test/regress/expected/typed_table.out     |  11 +-
 src/test/regress/expected/updatable_views.out |  18 +-
 src/test/regress/sql/domain.sql               |   2 +
 src/test/regress/sql/foreign_key.sql          |   2 +
 src/test/regress/sql/truncate.sql             |   2 +
 src/test/regress/sql/typed_table.sql          |   2 +
 src/test/regress/sql/updatable_views.sql      |   2 +
 src/tools/pgindent/typedefs.list              |   2 +
 27 files changed, 1475 insertions(+), 508 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..87a929dff9 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,30 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state,
+				   int tupnkeyatts, ScanKey key, ItemPointer scantid,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state,
+				   int tupnkeyatts, ScanKey key, ItemPointer scantid,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 Page other, int tupnkeyatts, ScanKey key,
+							 ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +845,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +915,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTIDCareful(state, itup, P_ISLEAF(topaque));
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +952,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +978,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1039,28 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple	righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTIDCareful(state, righttup,
+														   P_ISLEAF(topaque));
+			}
+
+			if (righttup && !invariant_g_offset(state, righttupnkeyatts,
+												rightkey, rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1103,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1117,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1132,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1321,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1338,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1387,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1438,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts,
+										  targetkey, scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1785,54 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1841,93 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity, since scan key
+	 * has to be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key,
+							 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid =
+			BTreeTupleGetHeapTIDCareful(state, child, P_ISLEAF(copaque));
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2083,32 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..dc6c65d201 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -34,30 +34,47 @@ Differences to the Lehman & Yao algorithm
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
+The requirement that all btree keys be unique is satisfied by treating
+heap TID as a tie-breaker attribute.  Logical duplicates are sorted in
+descending item pointer order.  We don't use btree keys to
+disambiguate downlinks from the internal pages during a page split,
+though: only one entry in the parent level will be pointing at the
+page we just split, so the link fields can be used to re-find
+downlinks in the parent via a linear search.
 
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+Lehman and Yao require that the key range for a subtree S is described
+by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the
+parent page, but do not account for the need to search the tree based
+only on leading index attributes in a composite index.  Since heap TID
+is always used to make btree keys unique (even in unique indexes),
+every btree index is treated as a composite index internally.  A
+search that finds exact equality to a pivot tuple in an upper tree
+level must descend to the left of that key to ensure it finds any
+equal keys, even when scan values were provided for all attributes.
+An insertion that sees that the high key of its target page is equal
+to the key to be inserted cannot move right, since the downlink for
+the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity
+downlink).
+
+We might be able to avoid moving left in the event of a full match on
+all attributes up to and including the heap TID attribute, but that
+would be a very narrow win, since it's rather unlikely that heap TID
+will be an exact match.  We can avoid moving left unnecessarily when
+all user-visible keys are equal by avoiding exact equality;  a
+sentinel value that's less than any possible heap TID is used by most
+index scans.  This is effective because of suffix truncation.  An
+"extra" heap TID attribute in pivot tuples is almost always avoided.
+All truncated attributes compare as minus infinity, even against a
+sentinel value, and the sentinel value is less than any real TID
+value, so an unnecessary move to the left is avoided regardless of
+whether or not a heap TID is present in the otherwise-equal pivot
+tuple.  Consistently moving left on full equality is also needed by
+page deletion, which re-finds a leaf page by descending the tree while
+searching on the leaf page's high key.  If we wanted to avoid moving
+left without breaking page deletion, we'd have to avoid suffix
+truncation, which could never be worth it.
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +627,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +679,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..0cb8bb1816 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,29 +24,44 @@
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/smgr.h"
+#include "utils/datum.h"
 #include "utils/tqual.h"
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+#define STACK_SPLIT_POINTS			15
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	int			fillfactor;		/* needed for weighted splits */
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplit;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +91,18 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static int _bt_perfect_firstdiff(Relation rel, Page page,
+					  OffsetNumber newitemoff, IndexTuple newitem,
+					  int nsplits, SplitPoint *splits, SplitMode *secondmode);
+static int _bt_split_firstdiff(Relation rel, Page page, OffsetNumber newitemoff,
+					IndexTuple newitem, SplitPoint *split);
+static int _bt_tuple_firstdiff(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright, bool *identical);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -113,9 +134,12 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		is_unique = false;
 	int			indnkeyatts;
 	ScanKey		itup_scankey;
+	ItemPointer itup_scantid;
 	BTStack		stack = NULL;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -123,6 +147,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
+	/* we use a heap TID with scan key if this isn't unique case */
+	itup_scantid = (checkUnique == UNIQUE_CHECK_NO ? &itup->t_tid : NULL);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -149,8 +175,6 @@ top:
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -180,8 +204,8 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, indnkeyatts, itup_scankey, itup_scantid,
+							page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +244,8 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, itup_scantid, false,
+						   &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +255,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -250,7 +275,11 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		Assert(itup_scantid == NULL);
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL,
+							 P_FIRSTDATAKEY(lpageop), false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -288,7 +317,7 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
+		/* do the insertion, possibly on a page to the right in unique case */
 		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
 						  stack, heapRel);
 		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
@@ -553,11 +582,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -601,31 +630,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
- *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
- *
  *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		where the new tuple could be inserted if we were to treat it as having
+ *		no implicit heap TID; only callers that just called _bt_check_unique()
+ *		provide this hint (all other callers should set *offsetptr to
+ *		InvalidOffsetNumber).  The caller should hold an exclusive lock on
+ *		*bufptr in all cases.  On exit, they both point to the chosen insert
+ *		location in all cases.  If _bt_findinsertloc decides to move right, the
+ *		lock and pin on the original page will be released, and the new page
+ *		returned to the caller is exclusively locked instead.
+ *
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.
  *
  *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		type scan key for it.  We take a "scantid" heap TID attribute value
+ *		from newtup directly.
  */
 static void
 _bt_findinsertloc(Relation rel,
@@ -641,9 +661,9 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(buf);
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
+	bool		hintinvalidated;
 	OffsetNumber newitemoff;
+	OffsetNumber lowitemoff;
 	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -673,59 +693,30 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/* firstlegaloff/offsetptr hint (if any) assumed valid initially */
+	hintinvalidated = false;
+
+	/*
+	 * TODO: Restore the logic for finding a page to insert on in the event of
+	 * many duplicates for pre-pg_upgrade indexes.  The whole search through
+	 * pages of logical duplicates to determine where to insert seems like
+	 * something that has little upside, but that doesn't make it okay to
+	 * ignore the performance characteristics after pg_upgrade is run, but
+	 * before a REINDEX can run to bump BTREE_VERSION.
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	while (true)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
-		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
-		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
-		 * step right to next non-dead page
+		 * step right to next non-dead page.  this is only needed for unique
+		 * indexes, and pg_upgrade'd indexes that still use BTREE_VERSION 2 or
+		 * 3, where heap TID isn't considered to be a part of the keyspace.
 		 *
 		 * must write-lock that page before releasing write lock on current
 		 * page; else someone else's _bt_check_unique scan could fail to see
@@ -764,24 +755,40 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		hintinvalidated = true;
+	}
+
+	Assert(P_ISLEAF(lpageop));
+
+	/*
+	 * Perform micro-vacuuming of the page we're about to insert tuple on to
+	 * if it looks like it has LP_DEAD items.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		hintinvalidated = true;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Consider using caller's hint to avoid repeated binary search effort.
+	 *
+	 * Note that the hint is only provided by callers that checked uniqueness.
+	 * The hint is used as a lower bound for a new binary search, since
+	 * caller's original binary search won't have specified a scan tid.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
+	if (firstlegaloff == InvalidOffsetNumber || hintinvalidated)
+		lowitemoff = P_FIRSTDATAKEY(lpageop);
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	{
+		Assert(firstlegaloff == _bt_binsrch(rel, buf, keysz, scankey, NULL,
+											P_FIRSTDATAKEY(lpageop), false));
+		lowitemoff = firstlegaloff;
+	}
+
+	newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid,
+							 lowitemoff, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -840,11 +847,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -889,8 +897,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1132,8 +1140,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1209,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1225,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1245,55 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
+	 * Truncate attributes of the high key item before inserting it on the
+	 * left page.  This can only happen at the leaf level, since in general
+	 * all pivot tuple values originate from leaf level high keys.  This isn't
+	 * just about avoiding unnecessary work, though; truncating unneeded key
+	 * suffix attributes can only be performed at the leaf level anyway.  This
+	 * is because a pivot tuple in a grandparent page must guide a search not
 	 * only to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber lastleftoff;
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple is on the left side of the split point, and
+		 * generate truncated copy of the right tuple.  Truncate as
+		 * aggressively as possible without generating a high key for the left
+		 * side of the split (and later downlink for the right side) that
+		 * fails to distinguish each side.  The new high key needs to be
+		 * strictly less than all tuples on the right side of the split, but
+		 * can be equal to items on the left side of the split.
+		 *
+		 * Handle the case where the incoming tuple is about to become the
+		 * last item on the left side of the split.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+			lastleft = newitem;
+		else
+		{
+			lastleftoff = OffsetNumberPrev(firstright);
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1486,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1514,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1535,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1548,6 +1572,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Generally speaking, only candidate split points that fall within
+ * an acceptable space utilization range are considered.  This is even
+ * useful with pages that only have a single (non-TID) attribute, since it's
+ * important to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Avoiding appending a heap TID can be thought of as a "logical" suffix
+ * truncation that "removes" the final attribute in the new high key for the
+ * new left page.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
+ * Many duplicates mode may lead to slightly inferior space utilization when
+ * values are spaced apart at fixed intervals, even on levels above the leaf
+ * level.  Even when that happens, many duplicates mode will probably still
+ * beat the generic default strategy.  Not having groups of duplicates
+ * straddle two leaf pages is likely to more than make up for having sparser
+ * pages, since "false sharing" of leaf blocks by index scans is avoided.  A
+ * point lookup will only visit one leaf page, not two. (This kind of false
+ * sharing may also have negative implications for page deletion during
+ * vacuuming, and may artificially increase the number of pages subsequently
+ * dirtied.)
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1556,6 +1613,17 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that leaves the left page mostly empty and the right page
+ * mostly full, rather than the other way around.  This greatly helps with
+ * space management in cases where tuples with the same attribute values
+ * span multiple pages.  Newly inserted duplicates will tend to have higher
+ * heap TID values, so we'll end up splitting the same page again and again
+ * as even more duplicates are inserted.  (The heap TID attribute has
+ * descending sort order, so ascending heap TID values continually split the
+ * same low page).
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1568,8 +1636,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1581,19 +1651,32 @@ _bt_findsplitloc(Relation rel,
 				rightspace,
 				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectfirstdiff,
+				bestfirstdiff,
+				lowsplit;
 	bool		goodenoughfound;
+	SplitPoint	splits[STACK_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
 		PageGetPageSize(page) - SizeOfPageHeaderData -
 		MAXALIGN(sizeof(BTPageOpaqueData));
 
+	/*
+	 * Conservatively assume that suffix truncation cannot avoid adding a heap
+	 * TID to the left half's new high key when splitting at the leaf level.
+	 * Accounting for the size of the rest of the high key comes later,  since
+	 * it's considered for every candidate split point.
+	 */
+	if (P_ISLEAF(opaque))
+		leftspace -= MAXALIGN(sizeof(ItemPointerData));
+
 	/* The right page will have the same high key as the old page */
 	if (!P_RIGHTMOST(opaque))
 	{
@@ -1605,18 +1688,37 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque) || mode == SPLIT_SINGLE_VALUE;
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (mode != SPLIT_SINGLE_VALUE)
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR);
+		else
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR;
+
+		if (mode == SPLIT_DEFAULT)
+			state.maxsplit = Min(Max(3, maxoff / 16), STACK_SPLIT_POINTS);
+		else if (mode == SPLIT_MANY_DUPLICATES)
+			state.maxsplit = maxoff;
+		else
+			state.maxsplit = 1;
+	}
 	else
+	{
+		Assert(mode == SPLIT_DEFAULT);
+
 		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+		state.maxsplit = 1;
+	}
+	state.nsplits = 0;
+	if (mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * maxoff);
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1625,11 +1727,13 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth in default mode; instead,
+	 * stop as soon as we find all "good-enough" splits, where good-enough is
+	 * defined as an imbalance in free space of no more than pagesize/16
+	 * (arbitrary...) This should let us stop near the middle on most pages,
+	 * instead of plowing to the end.  Many duplicates mode does consider
+	 * all choices, while single value mode gives up as soon as it finds a
+	 * good enough split point.
 	 */
 	goodenough = leftspace / 16;
 
@@ -1639,13 +1743,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1654,28 +1758,38 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
 			_bt_checksplitloc(&state, offnum, true,
 							  olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/*
+		 * Abort default mode scan once we've found a good-enough choice, and
+		 * reach the point where we stop finding new good-enough choices.
+		 */
+		if (state.nsplits > 0 && state.splits[0].delta <= goodenough)
 			goodenoughfound = true;
+
+		if (mode == SPLIT_DEFAULT && goodenoughfound && delta > goodenough)
+			break;
+
+		/*
+		 * Single value mode does not expect to be able to truncate; might as
+		 * well give up quickly once good enough value found.
+		 */
+		if (mode == SPLIT_SINGLE_VALUE && goodenoughfound)
 			break;
-		}
 
 		olddataitemstoleft += itemsz;
 	}
@@ -1692,12 +1806,80 @@ _bt_findsplitloc(Relation rel,
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with earliest
+	 * enclosing tuple pair differing attribute values.  The general idea is
+	 * to maximize the effectiveness of suffix truncation without affecting
+	 * the balance of space on each side of the split very much.
+	 *
+	 * First find lowest possible first differing attribute among array of
+	 * acceptable split points -- the "perfect" firstdiff.  This allows us to
+	 * return early without wasting cycles on calculating the first differing
+	 * attribute for all candidate splits when that clearly cannot improve our
+	 * choice.  This optimization is important for several common cases,
+	 * including insertion into a primary key index on an auto-incremented or
+	 * monotonically increasing integer column.
+	 *
+	 * This is also the point at which we decide to either finish splitting
+	 * the page using the default strategy, or, alternatively, to do a second
+	 * pass over page using a different strategy.  The second pass may be in
+	 * many duplicates mode, or in single value mode.
+	 */
+	perfectfirstdiff = 0;
+	secondmode = SPLIT_DEFAULT;
+	if (state.is_leaf && mode == SPLIT_DEFAULT)
+		perfectfirstdiff = _bt_perfect_firstdiff(rel, page, newitemoff, newitem,
+												 state.nsplits, state.splits,
+												 &secondmode);
+
+	/* newitemonleft output parameter is set recursively */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Now actually search among acceptable split points for the entry that
+	 * allows suffix truncation to truncate away the maximum possible number
+	 * of attributes.
+	 */
+	bestfirstdiff = INT_MAX;
+	lowsplit = 0;
+	for (int i = 0; i < state.nsplits; i++)
+	{
+		int			firstdiff;
+
+		/* Don't waste cycles */
+		if (perfectfirstdiff == INT_MAX || state.nsplits == 1)
+			break;
+
+		firstdiff = _bt_split_firstdiff(rel, page, newitemoff, newitem,
+										state.splits + i);
+
+		if (firstdiff <= perfectfirstdiff)
+		{
+			bestfirstdiff = firstdiff;
+			lowsplit = i;
+			break;
+		}
+
+		if (firstdiff < bestfirstdiff)
+		{
+			bestfirstdiff = firstdiff;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state.splits[lowsplit].newitemonleft;
+	finalfirstright = state.splits[lowsplit].firstright;
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1712,8 +1894,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1745,8 +1930,13 @@ _bt_checksplitloc(FindSplitData *state,
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.
+	 *
+	 * Note that we've already conservatively subtracted away overhead
+	 * required for the left/new high key to have an explicit head TID, on the
+	 * assumption that that cannot be avoided by suffix truncation.  (Leaf
+	 * pages only.)
 	 */
 	leftfree -= firstrightitemsz;
 
@@ -1765,17 +1955,20 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
 	if (leftfree >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
 				- ((100 - state->fillfactor) * rightfree);
@@ -1788,14 +1981,288 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		if (state->nsplits < state->maxsplit ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplit)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the earliest possible attribute that differs for any
+ * entry within array of acceptable candidate split points.
+ *
+ * This may be earlier than any real firstdiff for any of the candidate split
+ * points, in which case the optimization is ineffective.
+ */
+static int
+_bt_perfect_firstdiff(Relation rel, Page page, OffsetNumber newitemoff,
+					  IndexTuple newitem, int nsplits, SplitPoint *splits,
+					  SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectfirstdiff;
+	bool		identical;
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = splits[0].firstright;
+
+	/*
+	 * Split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  This is slightly
+	 * arbitrary.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  No great care is taken around boundary cases where the
+	 * center split point has the same firstright offset as either the
+	 * leftmost or rightmost split points (i.e. only newitemonleft differs).
+	 * We expect to find leftmost and rightmost tuples almost immediately.
+	 */
+	perfectfirstdiff = INT_MAX;
+	identical = false;
+	for (int j = nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = splits + j;
+
+		if (!leftmost && split->firstright < center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright > center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectfirstdiff = _bt_tuple_firstdiff(rel, leftmost, rightmost,
+												   &identical);
+			break;
 		}
 	}
+
+	/* Work out which type of second pass will be performed, if any */
+	if (identical)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		(void) _bt_tuple_firstdiff(rel, leftmost, rightmost, &identical);
+
+		/*
+		 * If page has many duplicates but is not entirely full of
+		 * duplicates, a many duplicates mode pass will be performed.  If
+		 * page is entirely full of duplicates, a single value mode pass
+		 * will be performed.
+		 *
+		 * Caller should avoid a single value mode pass when incoming tuple
+		 * doesn't sort lowest among items on the page, though.  Instead, we
+		 * instruct caller to continue with original default mode split,
+		 * since an out-of-order new item suggests that newer tuples have
+		 * come from (non-HOT) updates, not inserts.  Evenly sharing space
+		 * among each half of the split avoids pathological performance.
+		 */
+		if (identical)
+		{
+			if (P_FIRSTDATAKEY(opaque) == newitemoff)
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				perfectfirstdiff = INT_MAX;
+				*secondmode = SPLIT_DEFAULT;
+			}
+		}
+		else
+			*secondmode = SPLIT_MANY_DUPLICATES;
+	}
+
+	return perfectfirstdiff;
+}
+
+/*
+ * Subroutine to find first attribute that differs among the two tuples that
+ * enclose caller's candidate split point.
+ */
+static int
+_bt_split_firstdiff(Relation rel, Page page, OffsetNumber newitemoff,
+					IndexTuple newitem, SplitPoint *split)
+{
+	ItemId		itemid;
+	IndexTuple	lastleft;
+	IndexTuple	firstright;
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastleft != firstright);
+	return _bt_tuple_firstdiff(rel, lastleft, firstright, NULL);
+}
+
+/*
+ * Subroutine to find first attribute that differs between two tuples,
+ * typically two tuples that enclose a candidate split point.  Caller may also
+ * be interested in whether or not tuples are completely identical, in which
+ * case an "identical" parameter is passed.
+ *
+ * A naive bitwise approach to datum comparisons is used to save cycles.  This
+ * is inherently approximate, but works just as well as real scan key
+ * comparisons in most cases, since the vast majority of types in Postgres
+ * cannot be equal unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as our current approach.
+ * It would also be faster.  It might actually be necessary to go that way in
+ * the future, if suffix truncation is made sophisticated enough to truncate
+ * at a finer granularity (i.e. truncate within an attribute, rather than just
+ * truncating away whole attributes).  The current approach isn't markedly
+ * slower, since it works particularly well with the "perfectfirstdiff"
+ * optimization (there are fewer, more expensive calls here).  It also works
+ * with INCLUDE indexes (indexes with non-key attributes) without any special
+ * effort.
+ */
+static int
+_bt_tuple_firstdiff(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *identical)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			result;
+
+	result = 0;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		result++;
+	}
+
+	/* Report if left and right tuples are identical when requested */
+	if (identical)
+	{
+		if (result >= keysz)
+			*identical = true;
+		else
+			*identical = false;
+	}
+
+	return result;
 }
 
 /*
@@ -2199,7 +2666,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2311,8 +2779,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2326,12 +2794,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..f63615341c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1421,10 +1421,12 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
+				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..c229b7eed2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -69,11 +69,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 
 
 /*
- *	_bt_search() -- Search the tree for a particular scankey,
+ *	_bt_search() -- Search the tree for a particular scankey + scantid,
  *		or more precisely for the first leaf page it could be on.
  *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * but it can omit the rightmost column(s) of the index.  The scantid
+ * argument may also be omitted (caller passes NULL), since it's logically
+ * the "real" rightmost attribute.
  *
  * When nextkey is false (the usual case), we are looking for the first
  * item >= scankey.  When nextkey is true, we are looking for the first
@@ -94,8 +96,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +147,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid,
+							 P_FIRSTDATAKEY(opaque), nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +161,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * faster than comparing the keys themselves.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,7 +202,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  true, stack_in, BT_WRITE, snapshot);
 	}
 
@@ -245,6 +248,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -305,7 +309,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -337,6 +341,12 @@ _bt_moveright(Relation rel,
  * particular, this means it is possible to return a value 1 greater than the
  * number of keys on the page, if the scankey is > all keys on the page.)
  *
+ * Caller passes own low value for binary search.  This can be used to
+ * resume a partial binary search without repeated effort.  _bt_check_unique
+ * callers use this to avoid repeated work.  This only works when a buffer
+ * lock is held throughout, and we're passed a leaf page both times, and
+ * nextkey is false.
+ *
  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
  * of the last key < given scankey, or last key <= given scankey if nextkey
  * is true.  (Since _bt_compare treats the first data key of such a page as
@@ -354,19 +364,19 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
+			OffsetNumber low,
 			bool nextkey)
 {
 	Page		page;
 	BTPageOpaque opaque;
-	OffsetNumber low,
-				high;
+	OffsetNumber high;
 	int32		result,
 				cmpval;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
 	high = PageGetMaxOffsetNumber(page);
 
 	/*
@@ -401,7 +411,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -431,6 +441,50 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey.  The actual key value stored (if any, which there probably isn't)
+ * does not matter.  This convention allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first key.
+ * See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+			int keysz,
+			ScanKey scankey,
+			ItemPointer scantid,
+			Page page,
+			OffsetNumber offnum)
+{
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	/*
+	 * Force result ">" if target item is first data item on an internal page
+	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
+	 */
+	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+		return 1;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	return _bt_tuple_compare(rel, keysz, scankey, scantid, itup);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
@@ -445,37 +499,23 @@ _bt_binsrch(Relation rel,
  *		NULLs in the keys are treated as sortable values.  Therefore
  *		"equality" does not necessarily mean that the item should be
  *		returned to the caller as a matching key!
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
  *----------
  */
 int32
-_bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
-			Page page,
-			OffsetNumber offnum)
+_bt_tuple_compare(Relation rel,
+				  int keysz,
+				  ScanKey scankey,
+				  ItemPointer scantid,
+				  IndexTuple itup)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	IndexTuple	itup;
+	ItemPointer heapTid;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
 
-	Assert(_bt_check_natts(rel, page, offnum));
-
-	/*
-	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
-	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
-
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +529,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +581,31 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	return ItemPointerCompare(heapTid, scantid);
 }
 
 /*
@@ -570,6 +634,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
+	BTPageOpaque opaque;
 	BTStack		stack;
 	OffsetNumber offnum;
 	StrategyNumber strat;
@@ -577,6 +642,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	bool		goback;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ItemPointer scantid;
+	ItemPointerData minscantid;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -826,6 +893,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * scankeys[] array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scantid = NULL;
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -962,6 +1030,34 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
+	/*
+	 * When all key attributes will be in insertion scankey, manufacture
+	 * sentinel scan tid that's less than any possible heap TID in the index.
+	 * This is still greater than minus infinity to _bt_compare, allowing
+	 * _bt_search to follow a downlink with scankey-equal attributes, but a
+	 * truncated-away heap TID.
+	 *
+	 * If we didn't do this then affected index scans would have to
+	 * unnecessarily visit an extra page before moving right to the page they
+	 * should have landed on from the parent in the first place.  When
+	 * choosing a leaf page split point/new downlink, significant effort goes
+	 * towards avoiding a choice that necessitates appending a heap TID, so
+	 * this is likely to pay off.  See _bt_findsplitloc comments on "false
+	 * sharing".
+	 *
+	 * (Note that implementing this by adding hard-coding to _bt_compare is
+	 * unworkable, since some _bt_search callers need to re-find a leaf page
+	 * using the page's high key.)
+	 */
+	if (keysCount >= IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		scantid = &minscantid;
+
+		/* Heap TID attribute uses DESC ordering */
+		ItemPointerSetBlockNumber(scantid, InvalidBlockNumber);
+		ItemPointerSetOffsetNumber(scantid, InvalidOffsetNumber);
+	}
+
 	/*----------
 	 * Examine the selected initial-positioning strategy to determine exactly
 	 * where we need to start the scan, and set flag variables to control the
@@ -1054,11 +1150,11 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/*
-	 * Use the manufactured insertion scan key to descend the tree and
-	 * position ourselves on the target leaf page.
+	 * Use the manufactured insertion scan key (and possibly a scantid) to
+	 * descend the tree and position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, keysCount, scankeys, scantid, nextkey, &buf,
+					   BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1183,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(BufferGetPage(buf));
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, scantid,
+						 P_FIRSTDATAKEY(opaque), nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..e8f506cc09 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -796,8 +796,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -880,19 +878,30 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_suffix_truncate() can truncate away more
+			 * attributes, whereas the split point passed to _bt_split() is
+			 * chosen much more delicately.  Suffix truncation is mostly
+			 * useful because it can greatly improve space utilization for
+			 * workloads with random insertions, or insertions of
+			 * monotonically increasing values at "local" points in the key
+			 * space.  It doesn't seem worthwhile to add complex logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +914,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_suffix_truncate(wstate->index, lastleft, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +936,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +983,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1042,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1127,6 +1141,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1150,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1166,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					/* Deliberately invert the order, since TIDs "sort DESC" */
+					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4528e87c83..f9f3ec7914 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+				IndexTuple firstright);
 
 
 /*
@@ -56,27 +58,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +105,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple due
+		 * to suffix truncation.  Keys built from truncated attributes are
+		 * defensively represented as NULL values, though they should still
+		 * not be allowed to participate in comparisons (caller must be sure
+		 * to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2106,197 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to an insertion scan
+ * key).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * Note that returned tuple's t_tid offset will hold the number of
+ * attributes present, so the original item pointer offset is not
+ * represented.  Caller should only change truncated tuple's downlink.  Note
+ * also that truncated key attributes are treated as containing "minus
+ * infinity" values by _bt_compare()/_bt_tuple_compare().
+ *
+ * Returned tuple is guaranteed to be no larger than the original plus some
+ * extra space for a possible extra heap TID tie-breaker attribute.  This
+ * guarantee is important for staying under the 1/3 of a page restriction on
+ * tuple size.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where
+		 * we can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).  We should only call index_truncate_tuple() when non-TID
+		 * attributes need to be truncated.
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key
+		 * space, so it's still necessary to add a heap TID attribute to the
+		 * new pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since attributes are all equal.  It's
+		 * necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * Create enlarged copy of first right tuple to fit heap TID.  We must use
+	 * heap TID as a unique-ifier in new pivot tuple, since no non-TID
+	 * attribute distinguishes which values belong on each side of the split
+	 * point.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Generate a heap TID value to go in enlarged (not truncated) pivot
+	 * tuple.  Simply use the last left heap TID as new pivot's heap TID
+	 * value.  This code path is mostly used by cases where the page to be
+	 * split only contains duplicates, since the logic for picking a split
+	 * point tries very hard to avoid that, using all means available to it.
+	 * "Single value" mode was likely to have been used to pick this split
+	 * point.
+	 *
+	 * We could easily manufacturing a "median TID" value to use in the new
+	 * pivot, since optimizations like that often help fan-out when applied to
+	 * distinguishing/trailing non-TID attributes (adding opclass
+	 * infrastructure that gets called here to truncate non-TID attributes is
+	 * a possible future enhancement).  Using the last left heap TID actually
+	 * results in slightly better space utilization, though, because of the
+	 * specific properties of heap TID attributes.  This strategy maximizes
+	 * the number of duplicate tuples that will end up on the mostly-empty
+	 * left side of the split, and minimizes the number that will end up on
+	 * the mostly-full right side.  (This assumes that the split point was
+	 * likely chosen using "single value" mode.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  MAXALIGN(sizeof(ItemPointerData)));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on all current and future items on the right page
+	 * (this will be copied from the new high key for the left side of the
+	 * split).
+	 */
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
+	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (no truncated non-TID attributes) case, since caller
+	 * cares about whether or not it can avoid appending a heap TID as a
+	 * unique-ifier
+	 */
+	leavenatts = 1;
+	for (;;)
+	{
+		if (leavenatts > nkeyatts)
+			break;
+		if (_bt_tuple_compare(rel, leavenatts, skey, NULL, lastleft) > 0)
+			break;
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2319,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2339,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2349,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2360,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2393,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..7c061e96d2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..ffeb0624fe 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -854,10 +854,8 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 	 * PageIndexTupleDelete is the best way.  Delete the items in reverse
 	 * order so we don't have to think about adjusting item numbers for
 	 * previous deletions.
-	 *
-	 * TODO: tune the magic number here
 	 */
-	if (nitems <= 2)
+	if (nitems <= 7)
 	{
 		while (--nitems >= 0)
 			PageIndexTupleDelete(page, itemnos[nitems]);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9035..5211cf5b98 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,23 +4057,26 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
+		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (blk1 != blk2)
-			return (blk1 < blk2) ? -1 : 1;
+			return (blk1 < blk2) ? 1 : -1;
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
+		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (pos1 != pos2)
-			return (pos1 < pos2) ? -1 : 1;
+			return (pos1 < pos2) ? 1 : -1;
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..12f57773e7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -114,16 +114,26 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_VERSION	4		/* current version number */
+#define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to
+ * enlarge a heap index tuple to make space for a tie-breaker heap
+ * TID attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -133,11 +143,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	1
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -204,21 +218,23 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.  A pivot tuple must have INDEX_ALT_TID_MASK set
+ * as of BTREE_VERSION 4, however.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits): BT_HEAP_TID_ATTR, plus 3 bits that are
+ * reserved for future use.  BT_N_KEYS_OFFSET_MASK should be large enough to
+ * store any number <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented at end of tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +257,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +274,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -560,15 +609,18 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, OffsetNumber low,
+			bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, int keysz, ScanKey scankey,
+				  ItemPointer scantid, IndexTuple itup);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -601,7 +653,8 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..5f3c4a015a 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -82,20 +81,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a9041b0..f4899f2a38 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -643,10 +643,10 @@ update domnotnull set col1 = null; -- fails
 ERROR:  domain dnotnulltest does not allow null values
 alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+\set VERBOSITY default
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index fc3bbe4deb..1ec8264dfd 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -253,13 +253,13 @@ SELECT * FROM FKTABLE;
 (5 rows)
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 --
 -- First test, check with no on delete or on update
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..2c20cea4b9 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5896,8 +5896,8 @@ inner join j1 j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
  id1 | id2 | id1 | id2 
 -----+-----+-----+-----
-   1 |   1 |   1 |   1
    1 |   2 |   1 |   2
+   1 |   1 |   1 |   1
 (2 rows)
 
 reset enable_nestloop;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 2e26510522..c8b9a71689 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -276,11 +276,10 @@ SELECT * FROM trunc_faa;
 (0 rows)
 
 ROLLBACK;
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
-drop cascades to table trunc_faa
-drop cascades to table trunc_fb
+\set VERBOSITY default
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..c76efee358 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -75,19 +75,12 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 ERROR:  column "name" specified more than once
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
-table persons2 depends on type person_type
-table persons3 depends on type person_type
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
-drop cascades to table persons2
-drop cascades to table persons3
+\set VERBOSITY default
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index e64d693e9c..1ea90181d8 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
diff --git a/src/test/regress/sql/domain.sql b/src/test/regress/sql/domain.sql
index 68da27de22..d19e2c9d28 100644
--- a/src/test/regress/sql/domain.sql
+++ b/src/test/regress/sql/domain.sql
@@ -381,7 +381,9 @@ alter domain dnotnulltest drop not null;
 
 update domnotnull set col1 = null;
 
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
+\set VERBOSITY default
 
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sql/foreign_key.sql
index d2cecdf4eb..2c26191980 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -159,9 +159,11 @@ UPDATE PKTABLE SET ptest1=1 WHERE ptest1=2;
 SELECT * FROM FKTABLE;
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 DROP TABLE PKTABLE CASCADE;
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 
 
 --
diff --git a/src/test/regress/sql/truncate.sql b/src/test/regress/sql/truncate.sql
index 6ddfb6dd1d..fee7e76ec3 100644
--- a/src/test/regress/sql/truncate.sql
+++ b/src/test/regress/sql/truncate.sql
@@ -125,7 +125,9 @@ SELECT * FROM trunc_fa;
 SELECT * FROM trunc_faa;
 ROLLBACK;
 
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
+\set VERBOSITY default
 
 -- Test ON TRUNCATE triggers
 
diff --git a/src/test/regress/sql/typed_table.sql b/src/test/regress/sql/typed_table.sql
index 9ef0cdfcc7..953cd1f14b 100644
--- a/src/test/regress/sql/typed_table.sql
+++ b/src/test/regress/sql/typed_table.sql
@@ -43,8 +43,10 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 DROP TYPE person_type CASCADE;
+\set VERBOSITY default
 
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index dc6d5cbe35..6eaa81b540 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..f46a0e745d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2207,6 +2207,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

#17

Andrey Lepikhov

a.lepikhov@postgrespro.ru

over 7 years ago

In reply to: Peter Geoghegan (#15)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

I use the v5 version in quick vacuum strategy and in the heap&index
cleaner (see new patches at corresponding thread a little bit later). It
works fine and give quick vacuum 2-3% performance growup in comparison
with version v3 at my 24-core test server.
Note, that the interface of _bt_moveright() and _bt_binsrch() functions
with combination of scankey, scantid and nextkey parameters is too
semantic loaded.
Everytime of code reading i spend time to remember, what this functions
do exactly.
May be it needed to rewrite comments. For example, _bt_moveright()
comments may include phrase:
nextkey=false: Traverse to the next suitable index page if the current
page does not contain the value (scan key; scan tid).

What do you think about submitting the patch to the next CF?

12.09.2018 23:11, Peter Geoghegan пишет:

Attached is v4. I have two goals in mind for this revision, goals that
are of great significance to the project as a whole:

* Making better choices around leaf page split points, in order to
maximize suffix truncation and thereby maximize fan-out. This is
important when there are mostly-distinct index tuples on each leaf
page (i.e. most of the time). Maximizing the effectiveness of suffix
truncation needs to be weighed against the existing/main
consideration: evenly distributing space among each half of a page
split. This is tricky.

* Not regressing the logic that lets us pack leaf pages full when
there are a great many logical duplicates. That is, I still want to
get the behavior I described on the '"Write amplification" is made
worse by "getting tired" while inserting into nbtree secondary
indexes' thread [1]. This is not something that happens as a
consequence of thinking about suffix truncation specifically, and
seems like a fairly distinct thing to me. It's actually a bit similar
to the rightmost 90/10 page split case.

v4 adds significant new logic to make us do better on the first goal,
without hurting the second goal. It's easy to regress one while
focussing on the other, so I've leaned on a custom test suite
throughout development. Previous versions mostly got the first goal
wrong, but got the second goal right. For the time being, I'm
focussing on index size, on the assumption that I'll be able to
demonstrate a nice improvement in throughput or latency later. I can
get the main TPC-C order_line pkey about 7% smaller after an initial
bulk load with the new logic added to get the first goal (note that
the benefits with a fresh CREATE INDEX are close to zero). The index
is significantly smaller, even though the internal page index tuples
can themselves never be any smaller due to alignment -- this is all
about not restricting what can go on each leaf page by too much. 7% is
not as dramatic as the "get tired" case, which saw something like a
50% decrease in bloat for one pathological case, but it's still
clearly well worth having. The order_line primary key is the largest
TPC-C index, and I'm merely doing a standard bulk load to get this 7%
shrinkage. The TPC-C order_line primary key happens to be kind of
adversarial or pathological to B-Tree space management in general, but
it's still fairly realistic.

For the first goal, page splits now weigh what I've called the
"distance" between tuples, with a view to getting the most
discriminating split point -- the leaf split point that maximizes the
effectiveness of suffix truncation, within a range of acceptable split
points (acceptable from the point of view of not implying a lopsided
page split). This is based on probing IndexTuple contents naively when
deciding on a split point, without regard for the underlying
opclass/types. We mostly just use char integer comparisons to probe,
on the assumption that that's a good enough proxy for using real
insertion scankey comparisons (only actual truncation goes to those
lengths, since that's a strict matter of correctness). This distance
business might be considered a bit iffy by some, so I want to get
early feedback. This new "distance" code clearly needs more work, but
I felt that I'd gone too long without posting a new version.

For the second goal, I've added a new macro that can be enabled for
debugging purposes. This has the implementation sort heap TIDs in ASC
order, rather than DESC order. This nicely demonstrates how my two
goals for v4 are fairly independent; uncommenting "#define
BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many
duplicates are inserted, but won't regress things like the TPC-C
indexes. (Note that BTREE_ASC_HEAP_TID will break the regression
tests, though in a benign way can safely be ignored.)

Open items:

* Do more traditional benchmarking.

* Add pg_upgrade support.

* Simplify _bt_findsplitloc() logic.

[1] /messages/by-id/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#18

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Andrey Lepikhov (#17)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

Note, that the interface of _bt_moveright() and _bt_binsrch() functions with
combination of scankey, scantid and nextkey parameters is too semantic
loaded.
Everytime of code reading i spend time to remember, what this functions do
exactly.
May be it needed to rewrite comments.

I think that it might be a good idea to create an "BTInsertionScankey"
struct, or similar, since keysz, nextkey, the scankey array and now
scantid are all part of that, and are all common to these 4 or so
functions. It could have a flexible array at the end, so that we still
only need a single palloc(). I'll look into that.

What do you think about submitting the patch to the next CF?

Clearly the project that you're working on is a difficult one. It's
easy for me to understand why you might want to take an iterative
approach, with lots of prototyping. Your patch needs attention to
advance, and IMV the CF is the best way to get that attention. So, I
think that it would be fine to go submit it now.

I must admit that I didn't even notice that your patch lacked a CF
entry. Everyone has a different process, perhaps.

--
Peter Geoghegan

#19

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#16)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <pg@bowt.ie> wrote:

3 modes
-------

My new approach is to teach _bt_findsplitloc() 3 distinct modes of
operation: Regular/default mode, many duplicates mode, and single
value mode.

I think that I'll have to add a fourth mode, since I came up with
another strategy that is really effective though totally complementary
to the other 3 -- "multiple insertion point" mode. Credit goes to
Kevin Grittner for pointing out that this technique exists about 2
years ago [1]/messages/by-id/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com. The general idea is to pick a split point just after
the insertion point of the new item (the incoming tuple that prompted
a page split) when it looks like there are localized monotonically
increasing ranges. This is like a rightmost 90:10 page split, except
the insertion point is not at the rightmost page on the level -- it's
rightmost within some local grouping of values.

This makes the two largest TPC-C indexes *much* smaller. Previously,
they were shrunk by a little over 5% by using the new generic
strategy, a win that now seems like small potatoes. With this new
mode, TPC-C's order_line primary key, which is the largest index of
all, is ~45% smaller following a standard initial bulk load at
scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020
blocks (429.84 MiB). It's actually slightly smaller than it would be
after a fresh REINDEX with the new strategy. We see almost as big a
win with the second largest TPC-C index, the stock table's primary key
-- it's ~40% smaller.

Here is the definition of the biggest index, the order line primary key index:

pg@tpcc[3666]=# \d order_line_pkey
Index "public.order_line_pkey"
Column │ Type │ Key? │ Definition
───────────┼─────────┼──────┼────────────
ol_w_id │ integer │ yes │ ol_w_id
ol_d_id │ integer │ yes │ ol_d_id
ol_o_id │ integer │ yes │ ol_o_id
ol_number │ integer │ yes │ ol_number
primary key, btree, for table "public.order_line"

The new strategy/mode works very well because we see monotonically
increasing inserts on ol_number (an order's item number), but those
are grouped by order. It's kind of an adversarial case for our
existing implementation, and yet it seems like it's probably a fairly
common scenario in the real world.

Obviously these are very significant improvements. They really exceed
my initial expectations for the patch. TPC-C is generally considered
to be by far the most influential database benchmark of all time, and
this is something that we need to pay more attention to. My sense is
that the TPC-C benchmark is deliberately designed to almost require
that the system under test have this "multiple insertion point" B-Tree
optimization, suffix truncation, etc. This is exactly the same index
that we've seen reports of out of control bloat on when people run
TPC-C over hours or days [2]https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ -- Peter Geoghegan.

My next task is to find heuristics to make the new page split
mode/strategy kick in when it's likely to help, but not kick in when
it isn't (when we want something close to a generic 50:50 page split).
These heuristics should look similar to what I've already done to get
cases with lots of duplicates to behave sensibly. Anyone have any
ideas on how to do this? I might end up inferring a "multiple
insertion point" case from the fact that there are multiple
pass-by-value attributes for the index, with the new/incoming tuple
having distinct-to-immediate-left-tuple attribute values for the last
column, but not the first few. It also occurs to me to consider the
fragmentation of the page as a guide, though I'm less sure about that.
I'll probably need to experiment with a variety of datasets before I
settle on something that looks good. Forcing the new strategy without
considering any of this actually works surprisingly well on cases
where you'd think it wouldn't, since a 50:50 page split is already
something of a guess about where future insertions will end up.

[1]: /messages/by-id/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com
[2]: https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ -- Peter Geoghegan
--
Peter Geoghegan

#20

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: Peter Geoghegan (#16)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 19/09/2018 20:23, Peter Geoghegan wrote:

Attached is v5,

So. I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more. The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful. You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search. This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location. This should be equivalent since tids are effectively
random. I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc. In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Eisentraut (#20)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

So. I don't know much about the btree code, so don't believe anything I
say.

I think that showing up and reviewing this patch makes you somewhat of
an expert, by default. There just isn't enough expertise in this area.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more.

Up until recently, I thought that I would justify the patch primarily
as a project to make B-Trees less bloated when there are many
duplicates, with maybe as many as a dozen or more secondary benefits.
That's what I thought it would say in the release notes, even though
the patch was always a broader strategic thing. Now I think that the
TPC-C multiple insert point bloat issue might be the primary headline
benefit, though.

I hate to add more complexity to get it to work well, but just look at
how much smaller the indexes are following an initial bulk load (bulk
insertions) using my working copy of the patch:

Master

customer_pkey: 75 MB
district_pkey: 40 kB
idx_customer_name: 107 MB
item_pkey: 2216 kB
new_order_pkey: 22 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
oorder_pkey: 78 MB
order_line_pkey: 774 MB
stock_pkey: 181 MB
warehouse_pkey: 24 kB

Patch

customer_pkey: 50 MB
district_pkey: 40 kB
idx_customer_name: 105 MB
item_pkey: 2216 kB
new_order_pkey: 12 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
oorder_pkey: 42 MB
order_line_pkey: 429 MB
stock_pkey: 111 MB
warehouse_pkey: 24 kB

All of the indexes used by oltpbench to do TPC-C are listed, so you're
seeing the full picture for TPC-C bulk loading here (actually, there
is another index that has an identical definition to
oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted
as redundant). As you can see, all the largest indexes are
*significantly* smaller, with the exception of
oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this
improvement until I post the next version, though, since this is a
brand new development. Note that VACUUM hasn't been run at all, and
doesn't need to be run, as there are no dead tuples. Note also that
this has *nothing* to do with getting tired -- almost all of these
indexes are unique indexes.

Note that I'm also testing TPC-E and TPC-H in a very similar way,
which have both been improved noticeably, but to a degree that's much
less compelling than what we see with TPC-C. They have "getting tired"
cases that benefit quite a bit, but those are the minority.

Have you ever used HammerDB? I got this data from oltpbench, but I
think that HammerDB might be the way to go with TPC-C testing
Postgres.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search.This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

Thanks.

I tried a silly alternative approach by placing a new duplicate key in a
random location. This should be equivalent since tids are effectively
random.

You're never going to get any other approach to work remotely as well,
because while the TIDs may seem to be random in some sense, they have
various properties that are very useful from a high level, data life
cycle point of view. For insertions of duplicates, heap TID has
temporal locality -- you are only going to dirty one or two leaf
pages, rather than potentially dirtying dozens or hundreds.
Furthermore, heap TID is generally strongly correlated with primary
key values, so VACUUM can be much much more effective at killing
duplicates in low cardinality secondary indexes when there are DELETEs
with a range predicate on the primary key. This is a lot more
realistic than the 2018-07-09 test case, but it still could make as
big of a difference.

I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

FWIW, I actually just added to the papering over, rather than creating
a new problem. There are plenty of instances of "\set VERBOSITY terse"
in the regression tests already, for the same reason. If you run the
regression tests with ignore_system_indexes=on, there are very similar
failures [1]/messages/by-id/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan.

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I've already drafted this in my working copy. It is a clear
improvement. You can expect it in the next version.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.

I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.

In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

I admit that this patch has continued to evolve up until this week,
despite the fact that I thought it would be a lot more settled by now.
It has actually become simpler in recent months, though. And, I think
that the results justify the iterative approach I've taken. This stuff
is inherently very subtle, and I've had to spend a lot of time paying
attention to tiny regressions across a fairly wide variety of test
cases.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

Right.

[1]: /messages/by-id/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#22

Andrey Lepikhov

a.lepikhov@postgrespro.ru

over 7 years ago

In reply to: Peter Geoghegan (#21)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

28.09.2018 23:08, Peter Geoghegan wrote:

On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc.

I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.

I am reviewing this patch too. And join to Peter Eisentraut opinion
about splitting the patch into a hierarchy of two or three patches:
"functional" - tid stuff and "optimizational" - suffix truncation &
splitting. My reasons are simplification of code review, investigation
and benchmarking.
Now benchmarking is not clear. Possible performance degradation from TID
ordering interfere with positive effects from the optimizations in
non-trivial way.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#23

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Andrey Lepikhov (#22)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I am reviewing this patch too. And join to Peter Eisentraut opinion
about splitting the patch into a hierarchy of two or three patches:
"functional" - tid stuff and "optimizational" - suffix truncation &
splitting. My reasons are simplification of code review, investigation
and benchmarking.

As I mentioned to Peter, I don't think that I can split out the heap
TID stuff from the suffix truncation stuff. At least not without
making the patch even more complicated, for no benefit. I will split
out the "brain" of the patch (the _bt_findsplitloc() stuff, which
decides on a split point using sophisticated rules) from the "brawn"
(the actually changes to how index scans work, including the heap TID
stuff, as well as the code for actually physically performing suffix
truncation). The brain of the patch is where most of the complexity
is, as well as most of the code. The brawn of the patch is _totally
unusable_ without intelligence around split points, but I'll split
things up along those lines anyway. Doing so should make the whole
design a little easier to see follow.

Now benchmarking is not clear. Possible performance degradation from TID
ordering interfere with positive effects from the optimizations in
non-trivial way.

Is there any evidence of a regression in the last 2 versions? I've
been using pgbench, which didn't show any. That's not a sympathetic
case for the patch, though it would be nice to confirm if there was
some small improvement there. I've seen contradictory results (slight
improvements and slight regressions), but that was with a much earlier
version, so it just isn't relevant now. pgbench is mostly interesting
as a thing that we want to avoid regressing.

Once I post the next version, it would be great if somebody could use
HammerDB's OLTP test, which seems like the best fair use
implementation of TPC-C that's available. I would like to make that
the "this is why you should care, even if you happen to not believe in
the patch's strategic importance" benchmark. TPC-C is clearly the most
influential database benchmark ever, so I think that that's a fair
request. (See the TPC-C commentary at
https://www.hammerdb.com/docs/ch03s02.html, for example.)

--
Peter Geoghegan

#24

Peter Geoghegan

pg@bowt.ie

over 7 years ago

In reply to: Peter Geoghegan (#23)

6 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Sep 30, 2018 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

Now benchmarking is not clear. Possible performance degradation from TID
ordering interfere with positive effects from the optimizations in
non-trivial way.

Is there any evidence of a regression in the last 2 versions?

I did find a pretty clear regression, though only with writes to
unique indexes. Attached is v6, which fixes the issue. More on that
below.

v6 also:

* Adds a new-to-v6 "insert at new item's insertion point"
optimization, which is broken out into its own commit.

This *greatly* improves the index bloat situation with the TPC-C
benchmark in particular, even before the benchmark starts (just with
the initial bulk load). See the relevant commit message for full
details, or a couple of my previous mails on this thread. I will
provide my own TPC-C test data + test case to any reviewer that wants
to see this for themselves. It shouldn't be hard to verify the
improvement in raw index size with any TPC-C implementation, though.
Please make an off-list request if you're interested. The raw dump is
1.8GB.

The exact details of when this new optimization kick in and how it
works are tentative. They should really be debated. Reviewers should
try to think of edge cases in which my "heap TID adjacency" approach
could make the optimization kick in when it shouldn't -- cases where
it causes bloat rather than preventing it. I couldn't find any such
regressions, but this code was written very recently.

I should also look into using HammerDB to do a real TPC-C benchmark,
and really put the patch to the test...anybody have experience with
it?

* Generally groups everything into a relatively manageable series of
cumulative improvements, starting with the infrastructure required to
physically truncate tuples correctly, without any of the smarts around
selecting a split point.

The base patch is useless on its own, since it's just necessary to
have the split point selection smarts to see a consistent benefit.
Reviewers shouldn't waste their time doing any real benchmarking with
just the first patch applied.

* Adds a lot of new information to the nbtree README, about the
high-level thought process behind the design, including citing the
classic paper that this patch was primarily inspired by.

* Adds a new, dedicated insertion scan key struct --
BTScanInsert[Data]. This is passed around to a number of different
routines (_btsearch(), _bt_binsrch(), _bt_compare(), etc). This was
suggested by Andrey, and also requested by Peter Eisentraut.

While this BTScanInsert work started out as straightforward
refactoring, it actually led to my discovering and fixing the
regression I mentioned. Previously, I passed a lower bound on a binary
search to _bt_binsrch() within _bt_findinsertloc(). This wasn't nearly
as effective as what the master branch does for unique indexes at the
same point -- it usually manages to reuse a result from an earlier
_bt_binsrch() as the offset for the new tuple, since it has no need to
worry about the new tuple's position *among duplicates* on the page.
In earlier versions of my patch, most of the work of a second binary
search took place, despite being redundant and unnecessary. This
happened for every new insertion into a non-unique index -- I could
easily measure the problem with a simple serial test case. I can see
no regression there against master now, though.

My fix for the regression involves including some mutable state in the
new BTScanInsert struct (within v6-0001-*patch), to explicitly
remember and restore some internal details across two binary searches
against the same leaf page. We now remember a useful lower *and* upper
bound within bt_binsrch(), which is what is truly required to fix the
regression. While there is still a second call to _bt_binsrch() within
_bt_findinsertloc() for unique indexes, it will do no comparisons in
the common case where there are no existing dead duplicate tuples in
the unique index. This means that the number of _bt_compare() calls we
get in this _bt_findinsertloc() unique index path is the same as the
master branch in almost all cases (I instrumented the regression tests
to make sure of this). I also think that having BTScanInsert will ease
things around pg_upgrade support, something that remains an open item.
Changes in this area seem to make everything clearer -- the signature
of _bt_findinsertloc() seemed a bit jumbled to me.

Aside: I think that this BTScanInsert mutable state idea could be
pushed even further in the future. "Dynamic prefix truncation" could
be implemented by taking a similar approach when descending composite
indexes for an index scan (doesn't have to be a unique index). We can
observe that earlier attributes must all be equal to our own scankey's
values once we descend the tree and pass between a pair of pivot
tuples where a common prefix (some number of leading attributes) is
fully equal. It's safe to just not bother comparing these prefix
attributes on lower levels, because we can reason about their values
transitively; _bt_compare() can be told to always skip the first
attribute or two during later/lower-in-the-tree binary searches. This
idea will not be implemented for Postgres v12 by me, though.

--
Peter Geoghegan

Attachments:

v6-0005-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v6-0005-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 621965edb42e27d279333e23c031c2f2ca5aeef1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v6 5/6] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 66 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 +++++++
 3 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 184ac62255..a6e61e224a 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -242,6 +243,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -253,9 +255,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -264,6 +266,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -282,16 +286,51 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else
+		htid = BTreeTupleGetHeapTID(itup);
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -365,11 +404,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -396,12 +435,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -481,7 +521,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v6-0006-DEBUG-Allow-nbtree-to-use-ASC-heap-TID-order.patchapplication/x-patch; name=v6-0006-DEBUG-Allow-nbtree-to-use-ASC-heap-TID-order.patchDownload

From cb9e115a3a665f8a6dcde5cee39b9fa46852dd7d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 3 Oct 2018 10:40:47 -0700
Subject: [PATCH v6 6/6] DEBUG: Allow nbtree to use ASC heap TID order.

When the macro BTREE_ASC_HEAP_TID is defined (uncommented), the patch
will change the implementation to use ASC sort order rather than DESC
sort order.  This may be useful to reviewers.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
---
 src/backend/access/nbtree/nbtinsert.c |  4 ++++
 src/backend/access/nbtree/nbtsearch.c | 11 +++++++++++
 src/backend/access/nbtree/nbtsort.c   |  4 ++++
 src/backend/access/nbtree/nbtutils.c  | 12 ++++++++++++
 src/backend/utils/sort/tuplesort.c    | 10 ++++++++++
 src/include/access/nbtree.h           | 22 ++++++++++++++++++++++
 6 files changed, 63 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5bfafc0892..6336f90d4e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2396,7 +2396,11 @@ _bt_perfect_penalty(Relation rel, Page page, bool is_leaf, SplitMode mode,
 		 */
 		if (outerpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
 		{
+#ifndef BTREE_ASC_HEAP_TID
 			if (P_FIRSTDATAKEY(opaque) == newitemoff)
+#else
+			if (maxoff < newitemoff)
+#endif
 				*secondmode = SPLIT_SINGLE_VALUE;
 			else
 			{
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3ac408a6d..2a3c915085 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -637,8 +637,12 @@ _bt_tuple_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+#ifndef BTREE_ASC_HEAP_TID
 	/* Deliberately invert the order, since TIDs "sort DESC" */
 	return ItemPointerCompare(heapTid, key->scantid);
+#else
+	return ItemPointerCompare(key->scantid, heapTid);
+#endif
 }
 
 /*
@@ -1182,9 +1186,16 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		key->scantid = &minscantid;
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Heap TID attribute uses DESC ordering */
 		ItemPointerSetBlockNumber(key->scantid, InvalidBlockNumber);
 		ItemPointerSetOffsetNumber(key->scantid, InvalidOffsetNumber);
+#else
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(key->scantid, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(key->scantid, InvalidOffsetNumber);
+#endif
 	}
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c8e0e75487..16416a97f9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1156,8 +1156,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				 */
 				if (compare == 0)
 				{
+#ifndef BTREE_ASC_HEAP_TID
 					/* Deliberately invert the order, since TIDs "sort DESC" */
 					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+#else
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+#endif
 					Assert(compare != 0);
 					if (compare > 0)
 						load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index aeca964716..7e4493cd8d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2237,7 +2237,14 @@ _bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
+#ifndef BTREE_ASC_HEAP_TID
 	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+#else
+	/* Manufacture TID that's less than right TID, but only minimally */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+#endif
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2247,9 +2254,14 @@ _bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 	 * split).
 	 */
 
+#ifndef BTREE_ASC_HEAP_TID
 	/* Deliberately invert the order, since TIDs "sort DESC" */
 	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
 	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+#else
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
 	BTreeTupleSetAltHeapTID(pivot);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d0397008db..ee93912626 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4066,17 +4066,27 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (blk1 != blk2)
 			return (blk1 < blk2) ? 1 : -1;
+#else
+		if (blk1 != blk2)
+			return (blk1 < blk2) ? -1 : 1;
+#endif
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
+#ifndef BTREE_ASC_HEAP_TID
 		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (pos1 != pos2)
 			return (pos1 < pos2) ? 1 : -1;
+#else
+		if (pos1 != pos2)
+			return (pos1 < pos2) ? -1 : 1;
+#endif
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1e9869b30e..db6d850de8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -117,6 +117,24 @@ typedef struct BTMetaPageData
 #define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
+/*
+ * Heap TID behaves as a final key value within nbtree as of
+ * BTREE_VERSION 4.  This ensures that all entries keys are unique
+ * and relocatable.  By default, heap TIDs are sorted in DESC sort
+ * order within nbtree indexes.  ASC heap TID ordering may be
+ * useful during testing.
+ *
+ * DESC order was chosen because it allowed BTREE_VERSION 4 to
+ * maintain compatibility with unspecified BTREE_VERSION 2 + 3
+ * behavior that dependency management nevertheless relied on.
+ * However, DESC order also seems like it might be slightly better
+ * on its own merits, since continually splitting the same leaf
+ * page may cut down on the total number of FPIs generated when
+ * continually inserting tuples with the same user-visible
+ * attribute values.
+#define BTREE_ASC_HEAP_TID
+ */
+
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
@@ -151,7 +169,11 @@ typedef struct BTMetaPageData
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#ifndef BTREE_ASC_HEAP_TID
 #define BTREE_SINGLEVAL_FILLFACTOR	1
+#else
+#define BTREE_SINGLEVAL_FILLFACTOR	99
+#endif
 
 /*
  *	In general, the btree code tries to localize its knowledge about
-- 
2.17.1

v6-0003-Add-split-at-new-tuple-page-split-optimization.patchapplication/x-patch; name=v6-0003-Add-split-at-new-tuple-page-split-optimization.patchDownload

From e4b27a5e59f7acc4139a80888d2ffa7a88e95834 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v6 3/6] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing adjacent heap TIDs.  Only non-rightmost pages are
affected, to preserve existing behavior.

This enhancement is new to version 6 of the patch series.  This
enhancement has been demonstrated to be very effective at avoiding index
bloat when initial bulk INSERTs for the TPC-C benchmark are run.
Evidently, the primary keys for all of the largest indexes in the TPC-C
schema are populated through localized, monotonically increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.  (Note that the patch series generally has very
little advantage over master if the indexes are rebuilt via a REINDEX,
with or without this later commit.)

I (Peter Geoghegan) will provide reviewers with a convenient copy of
this test data if asked.  It comes from the oltpbench fair-use
implementation of TPC-C [1], but the same issue has independently been
observed with the BenchmarkSQL implementation of TPC-C [2].

Note that this commit also recognizes and prevents bloat with
monotonically *decreasing* tuple insertions (e.g., single-DESC-attribute
index on a date column).  Affected cases will typically leave their
index structure slightly smaller than an equivalent monotonically
increasing case would.

[1] http://oltpbenchmark.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
---
 src/backend/access/nbtree/nbtinsert.c | 185 +++++++++++++++++++++++++-
 1 file changed, 183 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a8332db7de..2b2f3a0be3 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -94,6 +94,8 @@ static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_dosplitatnewitem(Relation rel, Page page,
+					OffsetNumber newitemoff, IndexTuple newitem);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -105,6 +107,7 @@ static int _bt_perfect_penalty(Relation rel, Page page, bool is_leaf,
 					SplitPoint *splits, SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -1577,7 +1580,13 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * etc) we will end up with a tree whose pages are about fillfactor% full,
  * instead of the 50% full result that we'd get without this special case.
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * the fillfactor% is determined dynamically when _bt_dosplitatnewitem()
+ * indicates that there are localized monotonically increasing insertions,
+ * or monotonically decreasing (DESC order) insertions. (This can only
+ * happen with the default strategy, and should be thought of as a variant
+ * of the fillfactor% special case that is applied only when inserting into
+ * non-rightmost pages.)
  *
  * If called recursively in single value mode, we also try to arrange to
  * leave the left split page fillfactor% full, though we arrange to use a
@@ -1651,7 +1660,22 @@ _bt_findsplitloc(Relation rel,
 	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
 	{
-		if (mode != SPLIT_SINGLE_VALUE)
+		/*
+		 * Consider split at new tuple optimization.  See
+		 * _bt_dosplitatnewitem() for an explanation.
+		 */
+		if (mode == SPLIT_DEFAULT && !P_RIGHTMOST(opaque) &&
+			_bt_dosplitatnewitem(rel, page, newitemoff, newitem))
+		{
+			/*
+			 * fillfactor% is dynamically set through interpolation of the
+			 * new/incoming tuple's offset position
+			 */
+			state.fillfactor =
+				((double) newitemoff / (((double) maxoff + 1))) * 100;
+			state.is_weighted = true;
+		}
+		else if (mode != SPLIT_SINGLE_VALUE)
 		{
 			/* Only used on rightmost page */
 			state.fillfactor = RelationGetFillFactor(rel,
@@ -1971,6 +1995,126 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at
+ * approximately the point that the new/incoming item would have been
+ * inserted.
+ *
+ * This routine infers two distinct cases in which splitting around the new
+ * item's insertion point is likely to lead to better space utilization over
+ * time:
+ *
+ * - Composite indexes that consist of one or more leading columns that
+ *   describe some grouping, plus a trailing, monotonically increasing
+ *   column.  If there happened to only be one grouping then the traditional
+ *   rightmost page split default fillfactor% would be used to good effect,
+ *   so it seems worth recognizing this case.  This usage pattern is
+ *   prevalent in the TPC-C benchmark, and is assumed to be common in real
+ *   world applications.
+ *
+ * - DESC-ordered insertions, including DESC-ordered single (non-heap-TID)
+ *   key attribute indexes.  We don't want the performance of explicitly
+ *   DESC-ordered indexes to be out of line with an equivalent ASC-ordered
+ *   index.  Also, there may be organic cases where items are continually
+ *   inserted in DESC order for an index with ASC sort order.
+ *
+ * Caller uses fillfactor% rather than using the new item offset directly
+ * because it allows suffix truncation to be applied using the usual
+ * criteria, which can still be helpful.  This approach is also more
+ * maintainable, since restrictions on split points can be handled in the
+ * usual way.
+ *
+ * Localized insert points are inferred here by observing that neighboring
+ * heap TIDs are "adjacent".  For example, if the new item has distinct key
+ * attribute values to the existing item that belongs to its immediate left,
+ * and the item to its left has a heap TID whose offset is exactly one less
+ * than the new item's offset, then caller is told to use its new-item-split
+ * strategy.  It isn't of much consequence if this routine incorrectly
+ * infers that an interesting case is taking place, provided that that
+ * doesn't happen very often.  In particular, it should not be possible to
+ * construct a test case where the routine consistently does the wrong
+ * thing.  Since heap TID "adjacency" is such a delicate condition, and
+ * since there is no reason to imagine that random insertions should ever
+ * consistent leave new tuples at the first or last position on the page
+ * when a split is triggered, that will never happen.
+ *
+ * Note that we avoid using the split-at-new fillfactor% when we'd have to
+ * append a heap TID during suffix truncation.  We also insist that there
+ * are no varwidth attributes or NULL attribute values in new item, since
+ * that invalidates interpolating from the new item offset.  Besides,
+ * varwidths generally imply the use of datatypes where ordered insertions
+ * are not a naturally occurring phenomenon.
+ */
+static bool
+_bt_dosplitatnewitem(Relation rel, Page page, OffsetNumber newitemoff,
+					 IndexTuple newitem)
+{
+	ItemId		itemid;
+	OffsetNumber maxoff;
+	BTPageOpaque opaque;
+	IndexTuple	tup;
+	int16		nkeyatts;
+
+	if (IndexTupleHasNulls(newitem) || IndexTupleHasVarwidths(newitem))
+		return false;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Avoid optimization entirely on pages with large items */
+	if (maxoff <= 3)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/*
+	 * When heap TIDs appear in DESC order, consider left-heavy split.
+	 *
+	 * Accept left-heavy split when new item, which will be inserted at first
+	 * data offset, has adjacent TID to extant item at that position.
+	 */
+	if (newitemoff == P_FIRSTDATAKEY(opaque))
+	{
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/* Single key indexes only use DESC optimization */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * When tuple heap TIDs appear in ASC order, consider right-heavy split,
+	 * even though this may not be the right-most page.
+	 *
+	 * Accept right-heavy split when new item, which belongs after any
+	 * existing page offset, has adjacent TID to extant item that's the last
+	 * on the page.
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/*
+	 * When new item is approximately in the middle of the page, look for
+	 * adjacency among new item, and extant item that belongs to the left of
+	 * the new item in the keyspace.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+
+	return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+		_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -2250,6 +2394,43 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction, and probably not through heap_update().  This is not a
+ * commutative condition.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+	OffsetNumber lowoff,
+				highoff;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Not adjacent when blocks are not equal or highblk not one greater */
+	if (lowblk != highblk && lowblk + 1 != highblk)
+		return false;
+
+	lowoff = ItemPointerGetOffsetNumber(lowhtid);
+	highoff = ItemPointerGetOffsetNumber(highhtid);
+
+	/* When heap blocks match, second offset should be one up */
+	if (lowblk == highblk && OffsetNumberNext(lowoff) == highoff)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk && highoff == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
-- 
2.17.1

v6-0004-DEBUG-Add-page-split-instrumentation.patchapplication/x-patch; name=v6-0004-DEBUG-Add-page-split-instrumentation.patchDownload

From 58793ef42536b9db18b35969ab28584505b0fcf3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:09:28 -0700
Subject: [PATCH v6 4/6] DEBUG: Add page split instrumentation.

LOGs details on the new left high key in the event of a leaf page split.
This is an easy way to directly observe the effectiveness of suffix
truncation as it happens, which was useful during development.  The
macro DEBUG_SPLITS must be defined (uncommented) for the instrumentation
to be enabled.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
---
 src/backend/access/nbtree/nbtinsert.c | 60 +++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2b2f3a0be3..5bfafc0892 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -26,6 +26,11 @@
 #include "storage/smgr.h"
 #include "utils/tqual.h"
 
+/* #define DEBUG_SPLITS */
+#ifdef DEBUG_SPLITS
+#include "catalog/catalog.h"
+#endif
+
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 #define STACK_SPLIT_POINTS			15
@@ -1270,9 +1275,64 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *lastleftstr;
+			char	   *firstrightstr;
+			char	   *newstr;
+
+			index_deform_tuple(lastleft, itupdesc, values, isnull);
+			lastleftstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(item, itupdesc, values, isnull);
+			firstrightstr = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" leaf block %u "
+				 "last left %s first right %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 lastleftstr, firstrightstr,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)" : "",
+				 newstr);
+		}
+#endif
 	}
 	else
+	{
 		lefthikey = item;
+#ifdef DEBUG_SPLITS
+		if (IsNormalProcessingMode() && !IsSystemRelation(rel))
+		{
+			TupleDesc	itupdesc = RelationGetDescr(rel);
+			Datum		values[INDEX_MAX_KEYS];
+			bool		isnull[INDEX_MAX_KEYS];
+			char	   *newhighkey;
+			char	   *newstr;
+
+			index_deform_tuple(lefthikey, itupdesc, values, isnull);
+			newhighkey = BuildIndexValueDescription(rel, values, isnull);
+			index_deform_tuple(newitem, itupdesc, values, isnull);
+			newstr = BuildIndexValueDescription(rel, values, isnull);
+
+			elog(LOG, "\"%s\" internal block %u "
+				 "new high key %s "
+				 "attributes truncated: %u from %u%s new item %s",
+				 RelationGetRelationName(rel), BufferGetBlockNumber(buf),
+				 newhighkey,
+				 IndexRelationGetNumberOfKeyAttributes(rel) - BTreeTupleGetNAtts(lefthikey, rel),
+				 IndexRelationGetNumberOfKeyAttributes(rel),
+				 BTreeTupleGetHeapTID(lefthikey) != NULL ? " (heap TID added back)" : "",
+				 newstr);
+		}
+#endif
+	}
 
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
 	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
-- 
2.17.1

v6-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patchapplication/x-patch; name=v6-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patchDownload

From 0b55832169d2939345c40f26f193f977b399e181 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v6 2/6] Weigh suffix truncation when choosing a split point.

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.  Use this
within _bt_findsplitloc() to weigh how effective suffix truncation will
be.  This is primarily useful because it maximizes the effectiveness of
suffix truncation.  This should not noticeably affect the balance of
free space within each half of the split.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/README      |  64 ++-
 src/backend/access/nbtree/nbtinsert.c | 599 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  |  78 ++++
 src/include/access/nbtree.h           |   8 +-
 4 files changed, 682 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 48864910b4..8bb3bf5de8 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -159,9 +159,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -670,6 +670,64 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of split point can be thought of as a choice among points "between"
+items on the page to be split, at least if you pretend that the incoming
+tuple was placed on the page already, without provoking a split.  The split
+point between two index tuples with differences that appear as early as
+possible allows us to truncate away as many attributes as possible.
+Obviously suffix truncation is valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been, without actually making any pivot tuple smaller because of
+restrictions to preserve alignment.  Truncation prevents internal nodes
+from being more restrictive than truly necessary in how they describe which
+values belong on which leaf pages.  Average space utilization on the leaf
+level can be improved.  The number of internal pages is still reduced, but
+only as an indirect consequence.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the optimal fillfactor-wise split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Note that suffix truncation occasionally makes a pivot tuple larger than
+the leaf tuple that it's based on, since a heap TID must be appended when
+nothing else would distinguish each side of a leaf split.  This is
+represented differently in pivot tuples, for historic reasons.  Every tuple
+at the leaf level must be individually locatable by an insertion scankey
+that's fully filled-out by _bt_mkscankey(), so there is no way to
+completely avoid this.  As of Postgres v12, heap TID is treated as a
+tie-breaker key attribute to make this work. Adding a heap TID attribute
+during a leaf page split should only occur when there is an entire page
+full of duplicates, though, since the logic for selecting a split point
+will do all it can to avoid this outcome.
+
+Avoiding appending a heap TID to a pivot tuple is about much more than just
+making saving a single MAXALIGN() quantum.  It's worth going out of our way
+to avoid having a single value (or composition of key values) span two leaf
+pages when that isn't truly necessary, since if that's allowed to happen
+every point index scan will have to visit both pages.  It also makes it
+less likely that VACUUM will be able to perform page deletion on either
+page.  This is a kind of "false sharing".  Many duplicates mode is the
+mechanism that goes to great lengths to avoiding appending a heap TID.  It
+may lead to slightly inferior space utilization in extreme cases, when
+values are spaced apart at fixed intervals, even on levels above the leaf
+level.  This is considered an acceptable price to avoid groups of
+duplicates that straddle two leaf pages. A point lookup will only visit one
+leaf page, not two, since the high key on the first page will indicate that
+all duplicate values must be located on the first leaf page.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 825be932b5..a8332db7de 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,25 +28,40 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+#define STACK_SPLIT_POINTS			15
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	int			fillfactor;		/* needed for weighted splits */
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	bool		hikeyheaptid;	/* T if high key will likely get heap TID */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -74,12 +89,22 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page, bool is_leaf,
+					SplitMode mode, OffsetNumber newitemoff,
+					IndexTuple newitem, int nsplits,
+					SplitPoint *splits, SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -852,8 +877,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1527,6 +1552,25 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Generally speaking, only candidate split points that fall within
+ * an acceptable space utilization range are considered.  This is even
+ * useful with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new pivot
+ * tuple (high key/downlink) when it cannot actually be truncated.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1535,6 +1579,18 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that leaves the left page mostly empty and the right page
+ * mostly full, rather than the other way around.  This greatly helps with
+ * space management in cases where tuples with the same attribute values
+ * span multiple pages.  Newly inserted duplicates will tend to have higher
+ * heap TID values, so we'll end up splitting the same page again and again
+ * as even more duplicates are inserted.  (The heap TID attribute has
+ * descending sort order, so ascending heap TID values continually split the
+ * same low page).  See nbtree/README for more information about suffix
+ * truncation, and how a split point is chosen.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1547,8 +1603,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1560,13 +1618,15 @@ _bt_findsplitloc(Relation rel,
 				rightspace,
 				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectpenalty;
 	bool		goodenoughfound;
+	SplitPoint	splits[STACK_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
@@ -1584,18 +1644,44 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.hikeyheaptid = (mode == SPLIT_SINGLE_VALUE);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (mode != SPLIT_SINGLE_VALUE)
+		{
+			/* Only used on rightmost page */
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR);
+		}
+		else
+		{
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR;
+			state.is_weighted = true;
+		}
+	}
 	else
+	{
+		Assert(mode == SPLIT_DEFAULT);
+		/* Only used on rightmost page */
 		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+	}
+
+	if (mode == SPLIT_DEFAULT)
+		state.maxsplits = Min(Max(1, maxoff / 16), STACK_SPLIT_POINTS);
+	else if (mode == SPLIT_MANY_DUPLICATES)
+		state.maxsplits = maxoff + 2;
+	else
+		state.maxsplits = 1;
+	state.nsplits = 0;
+	if (mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1604,11 +1690,13 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth in default mode; instead,
+	 * stop as soon as we find all "good-enough" splits, where good-enough is
+	 * defined as an imbalance in free space of no more than pagesize/16
+	 * (arbitrary...) This should let us stop near the middle on most pages,
+	 * instead of plowing to the end.  Many duplicates mode does consider all
+	 * choices, while single value mode gives up as soon as it finds a good
+	 * enough split point.
 	 */
 	goodenough = leftspace / 16;
 
@@ -1618,13 +1706,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1633,27 +1721,41 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= goodenough)
 			goodenoughfound = true;
-			break;
+
+		/*
+		 * Abort default mode scan once we've found a good-enough choice, and
+		 * reach the point where we stop finding new good-enough choices.
+		 * Might as well abort as soon as good-enough split point is found in
+		 * single value mode, where we won't discriminate between a selection
+		 * of split points based on their penalty.
+		 */
+		if (goodenoughfound)
+		{
+			if (mode == SPLIT_DEFAULT && delta > goodenough)
+				break;
+			else if (mode == SPLIT_SINGLE_VALUE)
+				break;
+
+			/* Many duplicates mode must be exhaustive */
 		}
 
 		olddataitemstoleft += itemsz;
@@ -1664,19 +1766,52 @@ _bt_findsplitloc(Relation rel,
 	 * the old items go to the left page and the new item goes to the right
 	 * page.
 	 */
-	if (newitemoff > maxoff && !goodenoughfound)
+	if (newitemoff > maxoff &&
+		(!goodenoughfound || mode == SPLIT_MANY_DUPLICATES))
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to increase fan-out, by choosing a split point which is
+	 * amenable to being made smaller by suffix truncation, or is already
+	 * small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  This will be passed to _bt_bestsplitloc() if it
+	 * determines that candidate split points are good enough to finish
+	 * default mode split.  Perfect penalty saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, state.is_leaf, mode,
+										 newitemoff, newitem,
+										 state.nsplits, state.splits,
+										 &secondmode);
+
+	/* Start second pass over page if _bt_perfect_penalty() told us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1691,8 +1826,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1700,7 +1838,8 @@ _bt_checksplitloc(FindSplitData *state,
 				  Size firstoldonrightsz)
 {
 	int			leftfree,
-				rightfree;
+				rightfree,
+				leftleafheaptidsz;
 	Size		firstrightitemsz;
 	bool		newitemisfirstonright;
 
@@ -1720,15 +1859,38 @@ _bt_checksplitloc(FindSplitData *state,
 
 	/*
 	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
 	 */
 	leftfree -= firstrightitemsz;
 
+	/*
+	 * Assume that suffix truncation cannot avoid adding a heap TID to the
+	 * left half's new high key when splitting at the leaf level.  Don't let
+	 * this impact the balance of free space in the common case where adding a
+	 * heap TID is considered very unlikely, though, since there is no reason
+	 * to accept a likely-suboptimal split.
+	 *
+	 * When adding a heap TID seems likely, then actually factor that in to
+	 * delta calculation, rather than just having it as a constraint on
+	 * whether or not a split is acceptable.
+	 */
+	leftleafheaptidsz = 0;
+	if (state->is_leaf)
+	{
+		if (!state->hikeyheaptid)
+			leftleafheaptidsz = sizeof(ItemPointerData);
+		else
+			leftfree -= (int) sizeof(ItemPointerData);
+	}
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -1744,17 +1906,20 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
-	if (leftfree >= 0 && rightfree >= 0)
+	if (leftfree - leftleafheaptidsz >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
 				- ((100 - state->fillfactor) * rightfree);
@@ -1767,14 +1932,322 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/* No point calculating penalties in trivial cases */
+	if (perfectpenalty == INT_MAX || state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	/*
+	 * Now actually search among acceptable split points for the entry that
+	 * allows suffix truncation to truncate away the maximum possible number
+	 * of attributes.
+	 */
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
 		}
 	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is the common case.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, bool is_leaf, SplitMode mode,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					int nsplits, SplitPoint *splits, SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 */
+	if (!is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+	else if (mode == SPLIT_MANY_DUPLICATES)
+		return IndexRelationGetNumberOfKeyAttributes(rel);
+	else if (mode == SPLIT_SINGLE_VALUE)
+		return INT_MAX;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = splits[0].firstright;
+
+	/*
+	 * Split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  This is slightly
+	 * arbitrary.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  No great care is taken around boundary cases where the
+	 * center split point has the same firstright offset as either the
+	 * leftmost or rightmost split points (i.e. only newitemonleft differs).
+	 * We expect to find leftmost and rightmost tuples almost immediately.
+	 */
+	perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel);
+	for (int j = nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/* Work out which type of second pass will be performed, if any */
+	if (perfectpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			outerpenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		outerpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates, a single value mode pass will be performed.
+		 *
+		 * Caller should avoid a single value mode pass when incoming tuple
+		 * doesn't sort lowest among items on the page, though.  Instead, we
+		 * instruct caller to continue with original default mode split, since
+		 * an out-of-order new item suggests that newer tuples have come from
+		 * (non-HOT) updates, not inserts.  Evenly sharing space among each
+		 * half of the split avoids pathological performance.
+		 */
+		if (outerpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+		{
+			if (P_FIRSTDATAKEY(opaque) == newitemoff)
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				perfectpenalty = INT_MAX;
+				*secondmode = SPLIT_DEFAULT;
+			}
+		}
+		else
+			*secondmode = SPLIT_MANY_DUPLICATES;
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID needs to be appending during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	lastleft;
+	IndexTuple	firstright;
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	Assert(lastleft != firstright);
+	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index fabc676ba3..aeca964716 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2297,6 +2298,83 @@ _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 	return leavenatts;
 }
 
+/*
+ * _bt_leave_natts_fast - fast, approximate variant of _bt_leave_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_leave_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_leave_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			result;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases. However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_leave_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_leave_natts(rel, lastleft, firstright);
+#endif
+
+	result = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		result++;
+	}
+
+	return result;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b9d82f6dfc..1e9869b30e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -143,11 +143,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	1
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -693,6 +697,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_suffix_truncate(Relation rel, IndexTuple lastleft,
 					IndexTuple firstright);
+extern int _bt_leave_natts_fast(Relation rel, IndexTuple lastleft,
+					 IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
 					 Size itemsz);
-- 
2.17.1

v6-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchapplication/x-patch; name=v6-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchDownload

From 6c2a17c499ff3c9cd2792234e17878b5102d8927 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v6 1/6] Make nbtree indexes have unique keys in tuples.

Make nbtree treat all index tuples as having a heap TID trailing
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, though for now this is only used by insertions that need to find a
leaf page to insert a tuple on.  This general approach has numerous
benefits for performance, and may enable a later enhancement that has
nbtree vacuuming perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is also introduced.  This will usually truncate away the "extra"
heap TID attribute from pivot tuples during a leaf page split, and may
also truncate away additional user attributes.  This can increase
fan-out when there are multiple indexed attributes, though this is of
secondary importance.  Truncation can only occur at the attribute
granularity, which isn't particularly effective, but works well enough
for now.

We completely remove the logic that allows a search for free space among
multiple pages full of duplicates to "get tired".  This has significant
benefits for free space management in secondary indexes on low
cardinality attributes.  However, without the next commit in the patch
series (without having "single value" mode and "many duplicates" mode
within _bt_findsplitloc()), these cases will be significantly regressed,
since they'll naively perform 50:50 splits without there being any hope
of reusing space left free on the right half of the split.

Note that this revision of the patch series doesn't yet deal with
on-disk compatibility issues/pg_upgrade.  That will follow in a later
version.
---
 contrib/amcheck/verify_nbtree.c               | 242 ++++++++---
 contrib/pageinspect/expected/btree.out        |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out  |  10 +-
 src/backend/access/nbtree/README              | 162 +++++---
 src/backend/access/nbtree/nbtinsert.c         | 383 ++++++++----------
 src/backend/access/nbtree/nbtpage.c           |   8 +-
 src/backend/access/nbtree/nbtsearch.c         | 309 ++++++++++----
 src/backend/access/nbtree/nbtsort.c           |  81 ++--
 src/backend/access/nbtree/nbtutils.c          | 332 +++++++++++++--
 src/backend/access/nbtree/nbtxlog.c           |  41 +-
 src/backend/access/rmgrdesc/nbtdesc.c         |   8 -
 src/backend/storage/page/bufpage.c            |   4 +-
 src/backend/utils/sort/tuplesort.c            |  17 +-
 src/include/access/nbtree.h                   | 145 +++++--
 src/include/access/nbtxlog.h                  |  19 +-
 src/test/regress/expected/collate.out         |   3 +-
 src/test/regress/expected/domain.out          |   4 +-
 src/test/regress/expected/foreign_key.out     |   4 +-
 src/test/regress/expected/join.out            |   2 +-
 src/test/regress/expected/truncate.out        |   5 +-
 src/test/regress/expected/typed_table.out     |  11 +-
 src/test/regress/expected/updatable_views.out |  56 +--
 src/test/regress/sql/collate.sql              |   2 +
 src/test/regress/sql/domain.sql               |   2 +
 src/test/regress/sql/foreign_key.sql          |   2 +
 src/test/regress/sql/truncate.sql             |   2 +
 src/test/regress/sql/typed_table.sql          |   2 +
 src/test/regress/sql/updatable_views.sql      |  20 +
 src/tools/pgindent/typedefs.list              |   4 +
 29 files changed, 1249 insertions(+), 633 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..9afd912341 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,28 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+							  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
+static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +843,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,7 +911,14 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
 		skey = _bt_mkscankey(state->rel, itup);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
@@ -956,11 +972,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1032,20 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			IndexTuple	righttup;
+			BTScanInsert rightkey;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+				rightkey = _bt_mkscankey(state->rel, righttup);
+
+			if (righttup && !invariant_g_offset(state, rightkey, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1088,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1083,9 +1102,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1117,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1306,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1304,8 +1322,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1372,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1423,13 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1769,53 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (uppnkeyatts == key->keysz)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1824,89 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity, since scan key
+	 * has to be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid =
+			BTreeTupleGetHeapTIDCareful(state, child, P_ISLEAF(copaque));
+
+		if (uppnkeyatts == key->keysz)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2062,32 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..48864910b4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -34,30 +34,47 @@ Differences to the Lehman & Yao algorithm
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
+The requirement that all btree keys be unique is satisfied by treating
+heap TID as a tie-breaker attribute.  Logical duplicates are sorted in
+descending item pointer order.  We don't use btree keys to
+disambiguate downlinks from the internal pages during a page split,
+though: only one entry in the parent level will be pointing at the
+page we just split, so the link fields can be used to re-find
+downlinks in the parent via a linear search.
 
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+Lehman and Yao require that the key range for a subtree S is described
+by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the
+parent page, but do not account for the need to search the tree based
+only on leading index attributes in a composite index.  Since heap TID
+is always used to make btree keys unique (even in unique indexes),
+every btree index is treated as a composite index internally.  A
+search that finds exact equality to a pivot tuple in an upper tree
+level must descend to the left of that key to ensure it finds any
+equal keys, even when scan values were provided for all attributes.
+An insertion that sees that the high key of its target page is equal
+to the key to be inserted cannot move right, since the downlink for
+the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity
+downlink).
+
+We might be able to avoid moving left in the event of a full match on
+all attributes up to and including the heap TID attribute, but that
+would be a very narrow win, since it's rather unlikely that heap TID
+will be an exact match.  We can avoid moving left unnecessarily when
+all user-visible keys are equal by avoiding exact equality;  a
+sentinel value that's less than any possible heap TID is used by most
+index scans.  This is effective because of suffix truncation.  An
+"extra" heap TID attribute in pivot tuples is almost always avoided.
+All truncated attributes compare as minus infinity, even against a
+sentinel value, and the sentinel value is less than any real TID
+value, so an unnecessary move to the left is avoided regardless of
+whether or not a heap TID is present in the otherwise-equal pivot
+tuple.  Consistently moving left on full equality is also needed by
+page deletion, which re-finds a leaf page by descending the tree while
+searching on the leaf page's high key.  If we wanted to avoid moving
+left without breaking page deletion, we'd have to avoid suffix
+truncation, which could never be worth it.
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -598,33 +615,60 @@ the order of multiple keys for a given column is unspecified.)  An
 insertion scankey uses the same array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is exactly one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.  A truncated tuple logically retains all key attributes, though
+they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -642,9 +686,10 @@ so we have to play some games.
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so data
+items start with the first item.  Putting the high key at the left, rather
+than the right, may seem odd, but it avoids moving the high key as we add
+data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
@@ -658,4 +703,17 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..825be932b5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -52,16 +52,14 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber  _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool restorebinsrch,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
@@ -84,8 +82,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			 Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -111,18 +109,20 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	Assert(IndexRelationGetNumberOfKeyAttributes(rel) != 0);
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
+	/* remove heap TID from scan key in unique case for now */
+	if (checkUnique != UNIQUE_CHECK_NO)
+		itup_scankey->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -145,12 +145,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -174,14 +171,14 @@ top:
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
 			 * free space to accommodate the new tuple, and the insertion scan
-			 * key is strictly greater than the first key on the page.
+			 * key (with or without scantid set) is strictly greater than the
+			 * first key on the page.
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +217,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +227,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -249,9 +246,14 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		/* Later_bt_findinsertloc call to _bt_binsrch can reuse result */
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -274,10 +276,16 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore scantid */
+		Assert(itup_scankey->scantid == NULL);
+		itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber	insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -288,10 +296,11 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		/* do the insertion, possibly on a page to the right in unique case */
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf,
+									  checkUnique != UNIQUE_CHECK_NO, itup,
+									  stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -302,7 +311,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -327,13 +336,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -392,7 +400,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -553,11 +561,12 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			Assert(itup_scankey->scantid == NULL);
+			if (_bt_compare(rel, itup_scankey, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -599,40 +608,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessarily.  (Doesn't seem worthwhile to
+ *		optimize this further.)
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate caller's
+ *		restorebinsrch hint.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool restorebinsrch,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
@@ -641,91 +650,40 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(buf);
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop));
+	Assert(itup_scankey->scantid != NULL);
 
 	itemsz = IndexTupleSize(newtup);
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, page, itemsz);
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/*
+	 * TODO: Restore the logic for finding a page to insert on in the event of
+	 * many duplicates for pre-pg_upgrade indexes.  The whole search through
+	 * pages of logical duplicates to determine where to insert seems like
+	 * something that has little upside, but that doesn't make it okay to
+	 * ignore the performance characteristics after pg_upgrade is run, but
+	 * before a REINDEX can run to bump BTREE_VERSION.
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	while (true)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
-		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
-		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			likely(_bt_compare(rel, itup_scankey, page, P_HIKEY) <= 0))
 			break;
 
 		/*
-		 * step right to next non-dead page
+		 * step right to next non-dead page.  this is only needed for unique
+		 * indexes, and pg_upgrade'd indexes that still use BTREE_VERSION 2 or
+		 * 3, where heap TID isn't considered to be a part of the keyspace.
 		 *
 		 * must write-lock that page before releasing write lock on current
 		 * page; else someone else's _bt_check_unique scan could fail to see
@@ -747,7 +705,7 @@ _bt_findinsertloc(Relation rel,
 			 * good because finishing the split could be a fairly lengthy
 			 * operation.  But this should happen very seldom.
 			 */
-			if (P_INCOMPLETE_SPLIT(lpageop))
+			if (unlikely(P_INCOMPLETE_SPLIT(lpageop)))
 			{
 				_bt_finish_split(rel, rbuf, stack);
 				rbuf = InvalidBuffer;
@@ -764,27 +722,31 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform micro-vacuuming of the page we're about to insert tuple on to
+	 * if it looks like it has LP_DEAD items.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+
+	/* Assert that hinted binary search is consistent with unhinted case */
+	Assert(!itup_scankey->restorebinsrch);
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
 }
 
 /*----------
@@ -840,11 +802,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -1132,8 +1095,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1164,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1180,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1200,55 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
+	 * Truncate attributes of the high key item before inserting it on the
+	 * left page.  This can only happen at the leaf level, since in general
+	 * all pivot tuple values originate from leaf level high keys.  This isn't
+	 * just about avoiding unnecessary work, though; truncating unneeded key
+	 * suffix attributes can only be performed at the leaf level anyway.  This
+	 * is because a pivot tuple in a grandparent page must guide a search not
 	 * only to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber lastleftoff;
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple is on the left side of the split point, and
+		 * generate truncated copy of the right tuple.  Truncate as
+		 * aggressively as possible without generating a high key for the left
+		 * side of the split (and later downlink for the right side) that
+		 * fails to distinguish each side.  The new high key needs to be
+		 * strictly less than all tuples on the right side of the split, but
+		 * can be equal to items on the left side of the split.
+		 *
+		 * Handle the case where the incoming tuple is about to become the
+		 * last item on the left side of the split.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+			lastleft = newitem;
+		else
+		{
+			lastleftoff = OffsetNumberPrev(firstright);
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_suffix_truncate(rel, lastleft, item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1441,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1469,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1490,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -2199,7 +2178,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2296,6 +2276,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2311,28 +2292,24 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
 	/* Better be comparing to a leaf item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..26d97e884c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1370,7 +1370,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,10 +1421,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..d3ac408a6d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -72,16 +72,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.  If key was built
+ * using a leaf high key, leaf page will be relocated.
  *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
- * address of the leaf-page buffer, which is read-locked and pinned.
- * No locks are held on the parent pages, however!
+ * address of the leaf-page buffer, which is read-locked and pinned.  No locks
+ * are held on the parent pages, however!
  *
  * If the snapshot parameter is not NULL, "old snapshot" checking will take
  * place during the descent through the tree.  This is not needed when
@@ -94,8 +91,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +128,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +142,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +155,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * faster than comparing the keys themselves.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,8 +196,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -216,16 +213,16 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the key.nextkey=true
  * case, then we followed the wrong link and we need to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -242,10 +239,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -270,7 +265,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -284,7 +279,7 @@ _bt_moveright(Relation rel,
 		/*
 		 * Finish any incomplete splits we encounter along the way.
 		 */
-		if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+		if (unlikely(forupdate && P_INCOMPLETE_SPLIT(opaque)))
 		{
 			BlockNumber blkno = BufferGetBlockNumber(buf);
 
@@ -305,7 +300,8 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (unlikely(P_IGNORE(opaque) ||
+					 _bt_compare(rel, key, page, P_HIKEY) >= cmpval))
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -328,10 +324,6 @@ _bt_moveright(Relation rel,
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,37 +339,74 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				savehigh;
 	int32		result,
 				cmpval;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state when scantid is available */
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->restorebinsrch || key->scantid != NULL);
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available
+		 * slot.  Note this covers two cases: the page is really empty (no
+		 * keys), or it contains only a high key.  The latter case is
+		 * possible after vacuuming.  This can never happen on an internal
+		 * page, however, since they are never empty (an internal page must
+		 * have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				key->low = low;
+				key->high = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;			/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->high;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (unlikely(high < low))
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -391,22 +420,37 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	savehigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				savehigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->high = savehigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -421,7 +465,8 @@ _bt_binsrch(Relation rel,
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
-	 * There must be one if _bt_compare() is playing by the rules.
+	 * There must be one if _bt_compare()/_bt_tuple_compare() is playing by
+	 * the rules.
 	 */
 	Assert(low > P_FIRSTDATAKEY(opaque));
 
@@ -431,6 +476,48 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
+ *
+ * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
+ * "minus infinity": this routine will always claim it is less than the
+ * scankey.  The actual key value stored (if any, which there probably isn't)
+ * does not matter.  This convention allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first key.
+ * See backend/access/nbtree/README for details.
+ *----------
+ */
+int32
+_bt_compare(Relation rel,
+			BTScanInsert key,
+			Page page,
+			OffsetNumber offnum)
+{
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	/*
+	 * Force result ">" if target item is first data item on an internal page
+	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
+	 */
+	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
+		return 1;
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	return _bt_tuple_compare(rel, key, itup);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
@@ -445,37 +532,22 @@ _bt_binsrch(Relation rel,
  *		NULLs in the keys are treated as sortable values.  Therefore
  *		"equality" does not necessarily mean that the item should be
  *		returned to the caller as a matching key!
- *
- * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
- * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
  *----------
  */
 int32
-_bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
-			Page page,
-			OffsetNumber offnum)
+_bt_tuple_compare(Relation rel,
+				  BTScanInsert key,
+				  IndexTuple itup)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	IndexTuple	itup;
+	ItemPointer heapTid;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
+	ScanKey		scankey;
 
-	Assert(_bt_check_natts(rel, page, offnum));
-
-	/*
-	 * Force result ">" if target item is first data item on an internal page
-	 * --- see NOTE above.
-	 */
-	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
-		return 1;
-
-	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +561,9 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, key->keysz);
+	scankey = key->scankeys;
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +614,31 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (key->scantid == NULL)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (heapTid == NULL)
+		return 1;
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	return ItemPointerCompare(heapTid, key->scantid);
 }
 
 /*
@@ -575,8 +672,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsert key;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ScanKey		scankeys;
+	ItemPointerData minscantid;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -822,10 +921,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built in the scankeys[]
+	 * array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	key = palloc0(offsetof(BTScanInsertData, scankeys) +
+				  sizeof(ScanKeyData) * INDEX_MAX_KEYS);
+	scankeys = key->scankeys;
+
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -1053,12 +1156,42 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/*
+	 * Initialize insertion scankey.
+	 *
+	 * When all key attributes will be in insertion scankey, and scankey
+	 * isn't a nextkey search, manufacture sentinel scan tid that's less
+	 * than any possible heap TID in the index.  This is still greater than
+	 * minus infinity to _bt_compare, allowing _bt_search to follow a
+	 * downlink with scankey-equal attributes, but a truncated-away heap
+	 * TID.
+	 *
+	 * If we didn't do this then affected index scans would have to
+	 * unnecessarily visit an extra page before moving right to the page
+	 * they should have landed on from the parent in the first place.
+	 *
+	 * (Note that implementing this by adding hard-coding to _bt_compare is
+	 * unworkable, since some _bt_search callers need to re-find a leaf page
+	 * using the page's high key.)
+	 */
+	key->nextkey = nextkey;
+	key->keysz = keysCount;
+	key->scantid = NULL;
+	if (key->keysz >= IndexRelationGetNumberOfKeyAttributes(rel) &&
+		!key->nextkey)
+	{
+		key->scantid = &minscantid;
+
+		/* Heap TID attribute uses DESC ordering */
+		ItemPointerSetBlockNumber(key->scantid, InvalidBlockNumber);
+		ItemPointerSetOffsetNumber(key->scantid, InvalidOffsetNumber);
+	}
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, key, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1220,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, key, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..c8e0e75487 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -743,6 +743,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -796,8 +797,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -813,28 +812,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = IndexTupleSize(itup);
 	itupsz = MAXALIGN(itupsz);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
-	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap, npage, itupsz);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -880,19 +859,30 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_suffix_truncate() can truncate away more
+			 * attributes, whereas the split point passed to _bt_split() is
+			 * chosen much more delicately.  Suffix truncation is mostly
+			 * useful because it can greatly improve space utilization for
+			 * workloads with random insertions, or insertions of
+			 * monotonically increasing values at "local" points in the key
+			 * space.  It doesn't seem worthwhile to add complex logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +895,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_suffix_truncate(wstate->index, lastleft, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +917,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +964,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1023,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1115,7 +1110,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
+		pfree(indexScanKey);
 
 		for (;;)
 		{
@@ -1127,6 +1122,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1131,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1147,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					/* Deliberately invert the order, since TIDs "sort DESC" */
+					compare = ItemPointerCompare(&itup2->t_tid, &itup->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4528e87c83..fabc676ba3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,41 +49,54 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+				IndexTuple firstright);
 
 
 /*
  * _bt_mkscankey
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one at some point if the pivot tuple actually came
+ *		from the tree, barring the minus infinity special case).
  *
  *		The result is intended for use with _bt_compare().
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert res;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	res = palloc0(offsetof(BTScanInsertData, scankeys) +
+				  sizeof(ScanKeyData) * indnkeyatts);
+	skey = res->scankeys;
+	res->keysz = Min(indnkeyatts, tupnatts);
+	res->scantid = BTreeTupleGetHeapTID(itup);
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +109,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple due
+		 * to suffix truncation.  Keys built from truncated attributes are
+		 * defensively represented as NULL values, though they should still
+		 * not participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,7 +134,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
+	return res;
 }
 
 /*
@@ -159,15 +185,6 @@ _bt_mkscankey_nodata(Relation rel)
 	return skey;
 }
 
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
-}
-
 /*
  * free a retracement stack made by _bt_search.
  */
@@ -2083,38 +2100,201 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to an insertion scan
+ * key).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * Note that returned tuple's t_tid offset will hold the number of
+ * attributes present, so the original item pointer offset is not
+ * represented.  Caller should only change truncated tuple's downlink.  Note
+ * also that truncated key attributes are treated as containing "minus
+ * infinity" values by _bt_compare()/_bt_tuple_compare().
+ *
+ * Returned tuple is guaranteed to be no larger than the original plus some
+ * extra space for a possible extra heap TID tie-breaker attribute.  This
+ * guarantee is important for staying under the 1/3 of a page restriction on
+ * tuple size.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where
+		 * we can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).  We should only call index_truncate_tuple() when non-TID
+		 * attributes need to be truncated.
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key
+		 * space, so it's still necessary to add a heap TID attribute to the
+		 * new pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since attributes are all equal.  It's
+		 * necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * Create enlarged copy of first right tuple to fit heap TID.  We must use
+	 * heap TID as a unique-ifier in new pivot tuple, since no non-TID
+	 * attribute distinguishes which values belong on each side of the split
+	 * point.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Generate a heap TID value to go in enlarged (not truncated) pivot
+	 * tuple.  Simply use the last left heap TID as new pivot's heap TID
+	 * value.  This code path is mostly used by cases where the page to be
+	 * split only contains duplicates, since the logic for picking a split
+	 * point tries very hard to avoid that, using all means available to it.
+	 * "Single value" mode was likely to have been used to pick this split
+	 * point.
+	 *
+	 * We could easily manufacturing a "median TID" value to use in the new
+	 * pivot, since optimizations like that often help fan-out when applied to
+	 * distinguishing/trailing non-TID attributes (adding opclass
+	 * infrastructure that gets called here to truncate non-TID attributes is
+	 * a possible future enhancement).  Using the last left heap TID actually
+	 * results in slightly better space utilization, though, because of the
+	 * specific properties of heap TID attributes.  This strategy maximizes
+	 * the number of duplicate tuples that will end up on the mostly-empty
+	 * left side of the split, and minimizes the number that will end up on
+	 * the mostly-full right side.  (This assumes that the split point was
+	 * likely chosen using "single value" mode.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on all current and future items on the right page
+	 * (this will be copied from the new high key for the left side of the
+	 * split).
+	 */
+
+	/* Deliberately invert the order, since TIDs "sort DESC" */
+	Assert(ItemPointerCompare(&lastleft->t_tid, pivotheaptid) >= 0);
+	Assert(ItemPointerCompare(&firstright->t_tid, pivotheaptid) < 0);
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	BTScanInsert key;
+
+	key = _bt_mkscankey(rel, firstright);
+	key->scantid = NULL;
+
+	/*
+	 * Even test nkeyatts (no truncated non-TID attributes) case, since caller
+	 * cares about whether or not it can avoid appending a heap TID as a
+	 * unique-ifier
+	 */
+	leavenatts = 1;
+	for (;;)
+	{
+		if (leavenatts > nkeyatts)
+			break;
+		key->keysz = leavenatts;
+		if (_bt_tuple_compare(rel, key, lastleft) > 0)
+			break;
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	pfree(key);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2317,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2337,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2347,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2358,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,8 +2391,70 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
 }
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * Eventually, we ought to try to apply TOAST methods if not. We actually need
+ * to be able to fit three items on every page, so restrict any one item to
+ * 1/3 the per-page available space.  Note that itemsz should not include the
+ * ItemId overhead.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, Page page, Size itemsz)
+{
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	uint32		btm_version;
+	bool		needheaptidspace;
+
+	/* Double check item size against lower BTREE_VERSION 4 limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index is pre-BTREE_VERSION-4, in which case a slightly higher limit
+	 * applies.
+	 */
+	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+	btm_version = metad->btm_version;
+	_bt_relbuf(rel, metabuf);
+
+	/* Does tuple squeeze by due to fitting under legacy limit? */
+	needheaptidspace = btm_version >= 4;
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	if (needheaptidspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds BTREE_VERSION version %u maximum %zu for index \"%s\"",
+						itemsz, BTREE_VERSION, BTMaxItemSize(page),
+						RelationGetRelationName(rel)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds BTREE_VERSION version 2 and 3 maximum %zu for index \"%s\"",
+						itemsz, BTMaxItemSizeNoHeapTid(page),
+						RelationGetRelationName(rel)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..7c061e96d2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dfbda5458f..ffeb0624fe 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -854,10 +854,8 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
 	 * PageIndexTupleDelete is the best way.  Delete the items in reverse
 	 * order so we don't have to think about adjusting item numbers for
 	 * previous deletions.
-	 *
-	 * TODO: tune the magic number here
 	 */
-	if (nitems <= 2)
+	if (nitems <= 7)
 	{
 		while (--nitems >= 0)
 			PageIndexTupleDelete(page, itemnos[nitems]);
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index a649dc54eb..d0397008db 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4057,23 +4057,26 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
 		BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
 
+		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (blk1 != blk2)
-			return (blk1 < blk2) ? -1 : 1;
+			return (blk1 < blk2) ? 1 : -1;
 	}
 	{
 		OffsetNumber pos1 = ItemPointerGetOffsetNumber(&tuple1->t_tid);
 		OffsetNumber pos2 = ItemPointerGetOffsetNumber(&tuple2->t_tid);
 
+		/* Deliberately invert the order, since TIDs "sort DESC" */
 		if (pos1 != pos2)
-			return (pos1 < pos2) ? -1 : 1;
+			return (pos1 < pos2) ? 1 : -1;
 	}
 
 	return 0;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4cbc0..b9d82f6dfc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -114,16 +114,26 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
-#define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_VERSION	4		/* current version number */
+#define BTREE_MIN_VERSION	4	/* minimal supported version number */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to
+ * enlarge a heap index tuple to make space for a tie-breaker heap
+ * TID attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -204,21 +214,23 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.  A pivot tuple must have INDEX_ALT_TID_MASK set
+ * as of BTREE_VERSION 4, however.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits): BT_HEAP_TID_ATTR, plus 3 bits that are
+ * reserved for future use.  BT_N_KEYS_OFFSET_MASK should be large enough to
+ * store any number <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented at end of tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +253,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +270,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -320,6 +365,54 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples.  For details on its mutable
+ * state, see _bt_binsrch and _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search().
+ *
+ * keysz is the number of scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.  See nbtree/README
+ * for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state.  Used by _bt_binsrch() to inexpensively repeat a binary
+	 * search when only scantid has changed.
+	 */
+	bool		 savebinsrch;
+	bool		 restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* Immutable state */
+	bool		nextkey;
+	ItemPointer scantid;
+	int			keysz;
+	ScanKeyData scankeys[FLEXIBLE_ARRAY_MEMBER];
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -560,15 +653,13 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   BTScanInsert key,
 		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -577,9 +668,8 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -601,8 +691,11 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
+					 Size itemsz);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..5f3c4a015a 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -82,20 +81,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index fcbe3a5cc8..6a58a6ae8a 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -667,5 +667,6 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 -- must get rid of them.
 --
 \set VERBOSITY terse
+SET client_min_messages TO 'warning';
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+RESET client_min_messages;
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a9041b0..f4899f2a38 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -643,10 +643,10 @@ update domnotnull set col1 = null; -- fails
 ERROR:  domain dnotnulltest does not allow null values
 alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+\set VERBOSITY default
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index fc3bbe4deb..1ec8264dfd 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -253,13 +253,13 @@ SELECT * FROM FKTABLE;
 (5 rows)
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 --
 -- First test, check with no on delete or on update
 --
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index dc6262be43..2c20cea4b9 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5896,8 +5896,8 @@ inner join j1 j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1;
  id1 | id2 | id1 | id2 
 -----+-----+-----+-----
-   1 |   1 |   1 |   1
    1 |   2 |   1 |   2
+   1 |   1 |   1 |   1
 (2 rows)
 
 reset enable_nestloop;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 2e26510522..c8b9a71689 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -276,11 +276,10 @@ SELECT * FROM trunc_faa;
 (0 rows)
 
 ROLLBACK;
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
-drop cascades to table trunc_faa
-drop cascades to table trunc_fb
+\set VERBOSITY default
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..c76efee358 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -75,19 +75,12 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 ERROR:  column "name" specified more than once
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
-table persons2 depends on type person_type
-table persons3 depends on type person_type
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
-drop cascades to table persons2
-drop cascades to table persons3
+\set VERBOSITY default
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index e64d693e9c..8eca01a8e7 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
@@ -1054,10 +1040,10 @@ SELECT * FROM base_tbl;
 (2 rows)
 
 RESET SESSION AUTHORIZATION;
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- nested-view permissions
 CREATE TABLE base_tbl(a int, b text, c float);
 INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
@@ -1178,10 +1164,10 @@ ERROR:  permission denied for table base_tbl
 UPDATE rw_view2 SET b = 'bar' WHERE a = 1;  -- not allowed
 ERROR:  permission denied for table base_tbl
 RESET SESSION AUTHORIZATION;
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 DROP USER regress_view_user1;
 DROP USER regress_view_user2;
 -- column defaults
@@ -1439,11 +1425,10 @@ SELECT events & 4 != 0 AS upd,
  f   | f   | t
 (1 row)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
-drop cascades to view rw_view3
+\set VERBOSITY default
 -- inheritance tests
 CREATE TABLE base_tbl_parent (a int);
 CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
@@ -1540,10 +1525,10 @@ SELECT * FROM base_tbl_child ORDER BY a;
  20
 (6 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- simple WITH CHECK OPTION
 CREATE TABLE base_tbl (a int, b int DEFAULT 10);
 INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
@@ -1711,10 +1696,10 @@ SELECT * FROM base_tbl;
   30
 (3 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- WITH CHECK OPTION with no local view qual
 CREATE TABLE base_tbl (a int);
 CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WITH CHECK OPTION;
@@ -1740,11 +1725,10 @@ INSERT INTO rw_view3 VALUES (-3); -- should fail
 ERROR:  new row violates check option for view "rw_view2"
 DETAIL:  Failing row contains (-3).
 INSERT INTO rw_view3 VALUES (3); -- ok
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
-drop cascades to view rw_view3
+\set VERBOSITY default
 -- WITH CHECK OPTION with scalar array ops
 CREATE TABLE base_tbl (a int, b int[]);
 CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a = ANY (b)
@@ -1911,10 +1895,10 @@ SELECT * FROM base_tbl;
   -5 | 10
 (7 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 DROP FUNCTION rw_view1_trig_fn();
 CREATE TABLE base_tbl (a int);
 CREATE VIEW rw_view1 AS SELECT a,10 AS b FROM base_tbl;
@@ -1923,10 +1907,10 @@ CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
 CREATE VIEW rw_view2 AS
   SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
 INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- security barrier view
 CREATE TABLE base_tbl (person text, visibility text);
 INSERT INTO base_tbl VALUES ('Tom', 'public'),
@@ -2111,10 +2095,10 @@ EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
          Filter: ((visibility = 'public'::text) AND snoop(person) AND (NOT snoop(person)))
 (3 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- security barrier view on top of table with rules
 CREATE TABLE base_tbl(id int PRIMARY KEY, data text, deleted boolean);
 INSERT INTO base_tbl VALUES (1, 'Row 1', false), (2, 'Row 2', true);
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index 4ddde95a5e..94ef4e277e 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -260,4 +260,6 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 -- must get rid of them.
 --
 \set VERBOSITY terse
+SET client_min_messages TO 'warning';
 DROP SCHEMA collate_tests CASCADE;
+RESET client_min_messages;
diff --git a/src/test/regress/sql/domain.sql b/src/test/regress/sql/domain.sql
index 68da27de22..d19e2c9d28 100644
--- a/src/test/regress/sql/domain.sql
+++ b/src/test/regress/sql/domain.sql
@@ -381,7 +381,9 @@ alter domain dnotnulltest drop not null;
 
 update domnotnull set col1 = null;
 
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
+\set VERBOSITY default
 
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sql/foreign_key.sql
index d2cecdf4eb..2c26191980 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -159,9 +159,11 @@ UPDATE PKTABLE SET ptest1=1 WHERE ptest1=2;
 SELECT * FROM FKTABLE;
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 DROP TABLE PKTABLE CASCADE;
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 
 
 --
diff --git a/src/test/regress/sql/truncate.sql b/src/test/regress/sql/truncate.sql
index 6ddfb6dd1d..fee7e76ec3 100644
--- a/src/test/regress/sql/truncate.sql
+++ b/src/test/regress/sql/truncate.sql
@@ -125,7 +125,9 @@ SELECT * FROM trunc_fa;
 SELECT * FROM trunc_faa;
 ROLLBACK;
 
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
+\set VERBOSITY default
 
 -- Test ON TRUNCATE triggers
 
diff --git a/src/test/regress/sql/typed_table.sql b/src/test/regress/sql/typed_table.sql
index 9ef0cdfcc7..953cd1f14b 100644
--- a/src/test/regress/sql/typed_table.sql
+++ b/src/test/regress/sql/typed_table.sql
@@ -43,8 +43,10 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 DROP TYPE person_type CASCADE;
+\set VERBOSITY default
 
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index dc6d5cbe35..9103793ff4 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
@@ -457,7 +459,9 @@ DELETE FROM rw_view2 WHERE aa=4; -- not allowed
 SELECT * FROM base_tbl;
 RESET SESSION AUTHORIZATION;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- nested-view permissions
 
@@ -533,7 +537,9 @@ UPDATE rw_view2 SET b = 'bar' WHERE a = 1;  -- not allowed
 
 RESET SESSION AUTHORIZATION;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 DROP USER regress_view_user1;
 DROP USER regress_view_user2;
@@ -678,7 +684,9 @@ SELECT events & 4 != 0 AS upd,
        events & 16 != 0 AS del
   FROM pg_catalog.pg_relation_is_updatable('rw_view3'::regclass, false) t(events);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- inheritance tests
 
@@ -710,7 +718,9 @@ DELETE FROM ONLY rw_view2 WHERE a IN (-8, 8); -- Should delete -8 only
 SELECT * FROM ONLY base_tbl_parent ORDER BY a;
 SELECT * FROM base_tbl_child ORDER BY a;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
+\set VERBOSITY default
 
 -- simple WITH CHECK OPTION
 
@@ -772,7 +782,9 @@ SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
 INSERT INTO rw_view2 VALUES (30); -- ok, but not in view
 SELECT * FROM base_tbl;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- WITH CHECK OPTION with no local view qual
 
@@ -790,7 +802,9 @@ INSERT INTO rw_view2 VALUES (2); -- ok
 INSERT INTO rw_view3 VALUES (-3); -- should fail
 INSERT INTO rw_view3 VALUES (3); -- ok
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- WITH CHECK OPTION with scalar array ops
 
@@ -918,7 +932,9 @@ INSERT INTO rw_view2 VALUES (5); -- ok
 UPDATE rw_view2 SET a = -5 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view2's check)
 SELECT * FROM base_tbl;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP FUNCTION rw_view1_trig_fn();
 
 CREATE TABLE base_tbl (a int);
@@ -928,7 +944,9 @@ CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
 CREATE VIEW rw_view2 AS
   SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
 INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- security barrier view
 
@@ -1012,7 +1030,9 @@ EXPLAIN (costs off) SELECT * FROM rw_view2 WHERE snoop(person);
 EXPLAIN (costs off) UPDATE rw_view2 SET person=person WHERE snoop(person);
 EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- security barrier view on top of table with rules
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..08cf72d670 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -167,6 +167,8 @@ BTArrayKeyInfo
 BTBuildState
 BTCycleId
 BTIndexStat
+BTInsertionKey
+BTInsertionKeyData
 BTLeader
 BTMetaPageData
 BTOneVacInfo
@@ -2207,6 +2209,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

#25

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#24)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

I did find a pretty clear regression, though only with writes to
unique indexes. Attached is v6, which fixes the issue. More on that
below.

I've been benchmarking my patch using oltpbench's TPC-C benchmark
these past few weeks, which has been very frustrating -- the picture
is very mixed. I'm testing a patch that has evolved from v6, but isn't
too different.

In one way, the patch does exactly what it's supposed to do when these
benchmarks are run: it leaves indexes *significantly* smaller than the
master branch will on the same (rate-limited) workload, without
affecting the size of tables in any noticeable way. The numbers that I
got from my much earlier synthetic single client benchmark mostly hold
up. For example, the stock table's primary key is about 35% smaller,
and the order line index is only about 20% smaller relative to master,
which isn't quite as good as in the synthetic case, but I'll take it
(this is all because of the
v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff).
However, despite significant effort, and despite the fact that the
index shrinking is reliable, I cannot yet consistently show an
increase in either transaction throughput, or transaction latency.

I can show a nice improvement in latency on a slightly-rate-limited
TPC-C workload when backend_flush_after=0 (something like a 40%
reduction on average), but that doesn't hold up when oltpbench isn't
rate-limited and/or has backend_flush_after set. Usually, there is a
1% - 2% regression, despite the big improvements in index size, and
despite the big reduction in the amount of buffers that backends must
write out themselves.

The obvious explanation is that throughput is decreased due to our
doing extra work (truncation) while under an exclusive buffer lock.
However, I've worked hard on that, and, as I said, I can sometimes
observe a nice improvement in latency. This makes me doubt the obvious
explanation. My working theory is that this has something to do with
shared_buffers eviction. Maybe we're making worse decisions about
which buffer to evict, or maybe the scalability of eviction is hurt.
Perhaps both.

You can download results from a recent benchmark to get some sense of
this. It includes latency and throughput graphs, plus details
statistics collector stats:

https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.

--
Peter Geoghegan

#26

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Peter Geoghegan (#25)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Hi,

On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote:

I can show a nice improvement in latency on a slightly-rate-limited
TPC-C workload when backend_flush_after=0 (something like a 40%
reduction on average), but that doesn't hold up when oltpbench isn't
rate-limited and/or has backend_flush_after set. Usually, there is a
1% - 2% regression, despite the big improvements in index size, and
despite the big reduction in the amount of buffers that backends must
write out themselves.

What kind of backend_flush_after values where you trying?
backend_flush_after=0 obviously is the default, so I'm not clear on
that. How large is the database here, and how high is shared_buffers

The obvious explanation is that throughput is decreased due to our
doing extra work (truncation) while under an exclusive buffer lock.
However, I've worked hard on that, and, as I said, I can sometimes
observe a nice improvement in latency. This makes me doubt the obvious
explanation. My working theory is that this has something to do with
shared_buffers eviction. Maybe we're making worse decisions about
which buffer to evict, or maybe the scalability of eviction is hurt.
Perhaps both.

Is it possible that there's new / prolonged cases where a buffer is read
from disk after the patch? Because that might require doing *write* IO
when evicting the previous contents of the victim buffer, and obviously
that can take longer if you're running with backend_flush_after > 0.

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.

You can download results from a recent benchmark to get some sense of
this. It includes latency and throughput graphs, plus details
statistics collector stats:

https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I'm uncllear which runs are what here? I assume "public" is your
patchset, and master is master? Do you reset the stats inbetween runs?

Greetings,

Andres Freund

#27

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andres Freund (#26)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public'
is the patch case. Sorry for not mentioning it initially.

--
Peter Geoghegan
(Sent from my phone)

#28

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andres Freund (#26)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:

What kind of backend_flush_after values where you trying?
backend_flush_after=0 obviously is the default, so I'm not clear on
that. How large is the database here, and how high is shared_buffers

I *was* trying backend_flush_after=512kB, but it's
backend_flush_after=0 in the benchmark I posted. See the
"postgres*settings" files.

On the master branch, things looked like this after the last run:

pg@tpcc_oltpbench[15547]=# \dt+
List of relations
Schema │ Name │ Type │ Owner │ Size │ Description
────────┼────────────┼───────┼───────┼──────────┼─────────────
public │ customer │ table │ pg │ 4757 MB │
public │ district │ table │ pg │ 5240 kB │
public │ history │ table │ pg │ 1442 MB │
public │ item │ table │ pg │ 10192 kB │
public │ new_order │ table │ pg │ 140 MB │
public │ oorder │ table │ pg │ 1185 MB │
public │ order_line │ table │ pg │ 19 GB │
public │ stock │ table │ pg │ 9008 MB │
public │ warehouse │ table │ pg │ 4216 kB │
(9 rows)

pg@tpcc_oltpbench[15547]=# \di+
List of relations
Schema │ Name │ Type │ Owner │
Table │ Size │ Description
────────┼──────────────────────────────────────┼───────┼───────┼────────────┼─────────┼─────────────
public │ customer_pkey │ index │ pg │
customer │ 367 MB │
public │ district_pkey │ index │ pg │
district │ 600 kB │
public │ idx_customer_name │ index │ pg │
customer │ 564 MB │
public │ idx_order │ index │ pg │
oorder │ 715 MB │
public │ item_pkey │ index │ pg │ item
│ 2208 kB │
public │ new_order_pkey │ index │ pg │
new_order │ 188 MB │
public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg │
oorder │ 715 MB │
public │ oorder_pkey │ index │ pg │
oorder │ 958 MB │
public │ order_line_pkey │ index │ pg │
order_line │ 9624 MB │
public │ stock_pkey │ index │ pg │ stock
│ 904 MB │
public │ warehouse_pkey │ index │ pg │
warehouse │ 56 kB │
(11 rows)

Is it possible that there's new / prolonged cases where a buffer is read
from disk after the patch? Because that might require doing *write* IO
when evicting the previous contents of the victim buffer, and obviously
that can take longer if you're running with backend_flush_after > 0.

Yes, I suppose that that's possible, because the buffer
popularity/usage_count will be affected in ways that cannot easily be
predicted. However, I'm not running with "backend_flush_after > 0"
here -- that was before.

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.

I'll look into this.

Thanks
--
Peter Geoghegan

#29

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andres Freund (#26)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.

I tried this. It looks like we're calling FlushBuffer() with more than
a single LWLock held (not just the single buffer lock) somewhat *less*
with the patch. This is a positive sign for the patch, but also means
that I'm no closer to figuring out what's going on.

I tested a case with a 1GB shared_buffers + a TPC-C database sized at
about 10GB. I didn't want the extra LOG instrumentation to influence
the outcome.

--
Peter Geoghegan

#30

Andrey Lepikhov

a.lepikhov@postgrespro.ru

about 7 years ago

In reply to: Peter Geoghegan (#25)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 19.10.2018 0:54, Peter Geoghegan wrote:

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.

I have same problem with background heap & index cleaner (based on your
patch). In this case the bottleneck is WAL-record which I need to write
for each cleaned block and locks which are held during the WAL-record
writing process.
Maybe you will do a test without writing any data to disk?

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#31

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andrey Lepikhov (#30)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I have same problem with background heap & index cleaner (based on your
patch). In this case the bottleneck is WAL-record which I need to write
for each cleaned block and locks which are held during the WAL-record
writing process.

Part of the problem here is that v6 uses up to 25 candidate split
points, even during regularly calls to _bt_findsplitloc(). That was
based on some synthetic test-cases. I've found that I can get most of
the benefit in index size with far fewer spilt points, though. The
extra work done with an exclusive buffer lock held will be
considerably reduced in v7. I'll probably post that in a couple of
weeks, since I'm in Europe for pgConf.EU. I don't fully understand the
problems here, but even still I know that what you were testing wasn't
very well optimized for write-heavy workloads. It would be especially
bad with pgbench, since there isn't much opportunity to reduce the
size of indexes there.

Maybe you will do a test without writing any data to disk?

Yeah, I should test that on its own. I'm particularly interested in
TPC-C, because it's a particularly good target for my patch. I can
find a way of only executing the read TPC-C queries, to see where they
are on their own. TPC-C is particularly write-heavy, especially
compared to the much more recent though less influential TPC-E
benchmark.

--
Peter Geoghegan

#32

Andrey Lepikhov

a.lepikhov@postgrespro.ru

about 7 years ago

In reply to: Peter Geoghegan (#31)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

I do the code review.
Now, it is first patch - v6-0001... dedicated to a logical duplicates
ordering.

Documentation is full and clear. All non-trivial logic is commented
accurately.

Patch applies cleanly on top of current master. Regression tests passed
and my "Retail Indextuple deletion" use cases works without mistakes.
But I have two comments on the code.
New BTScanInsert structure reduces parameters list of many functions and
look fine. But it contains some optimization part ('restorebinsrch'
field et al.). It is used very locally in the code -
_bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize
this logic into separate struct, which will passed to _bt_binsrch() as
pointer. Another routines may pass NULL value to this routine. It is may
simplify usability of the struct.

Due to the optimization the _bt_binsrch() size has grown twice. May be
you move this to some service routine?

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#33

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andrey Lepikhov (#32)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

Documentation is full and clear. All non-trivial logic is commented
accurately.

Glad you think so.

I had the opportunity to discuss this patch at length with Heikki
during pgConf.EU. I don't want to speak on his behalf, but I will say
that he seemed to understand all aspects of the patch series, and
seemed generally well disposed towards the high level design. The
high-level design is the most important aspect -- B-Trees can be
optimized in many ways, all at once, and we must be sure to come up
with something that enables most or all of them. I really care about
the long term perspective.

That conversation with Heikki eventually turned into a conversation
about reimplementing GIN using the nbtree code, which is actually
related to my patch series (sorted on heap TID is the first step to
optional run length encoding for duplicates). Heikki seemed to think
that we can throw out a lot of the optimizations within GIN, and add a
few new ones to nbtree, while still coming out ahead. This made the
general nbtree-as-GIN idea (which we've been talking about casually
for years) seem a lot more realistic to me. Anyway, he requested that
I support this long term goal by getting rid of the DESC TID sort
order thing -- that breaks GIN-style TID compression. It also
increases the WAL volume unnecessarily when a page is split that
contains all duplicates.

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)

Patch applies cleanly on top of current master. Regression tests passed
and my "Retail Indextuple deletion" use cases works without mistakes.

Cool.

New BTScanInsert structure reduces parameters list of many functions and
look fine. But it contains some optimization part ('restorebinsrch'
field et al.). It is used very locally in the code -
_bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize
this logic into separate struct, which will passed to _bt_binsrch() as
pointer. Another routines may pass NULL value to this routine. It is may
simplify usability of the struct.

Hmm. I see your point. I did it that way because the knowledge of
having cached an upper and lower bound for a binary search of a leaf
page needs to last for a relatively long time. I'll look into it
again, though.

Due to the optimization the _bt_binsrch() size has grown twice. May be
you move this to some service routine?

Maybe. There are some tricky details that seem to work against it.
I'll see if it's possible to polish that some more, though.

--
Peter Geoghegan

#34

Andrey Lepikhov

a.lepikhov@postgrespro.ru

about 7 years ago

In reply to: Peter Geoghegan (#33)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 03.11.2018 5:00, Peter Geoghegan wrote:

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used. (Technically those failures are a pre-existing
problem, a problem that I mask by using DESC order...which is weird.
The problem is masked in the master branch by accidental behaviors
around nbtree duplicates, which is something that deserves to die.
DESC order is closer to the accidental current behavior.)

I applied your patches at top of master. After tests corrections
(related to TID ordering in index relations DROP...CASCADE operation)
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62
other objects' problem - I verify an Álvaro Herrera hypothesis [1]/messages/by-id/20180504022601.fflymidf7eoencb2@alvherre.pgsql and
it is true. You can verify it by tracking the
object_address_present_add_flags() routine return value.
Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

[1]: /messages/by-id/20180504022601.fflymidf7eoencb2@alvherre.pgsql
/messages/by-id/20180504022601.fflymidf7eoencb2@alvherre.pgsql

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#35

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#33)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Nov 2, 2018 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I had the opportunity to discuss this patch at length with Heikki
during pgConf.EU.

The DESC heap TID sort order thing probably needs to go. I'll probably
have to go fix the regression test failures that occur when ASC heap
TID order is used.

I've found that TPC-C testing with ASC heap TID order fixes the
regression that I've been concerned about these past few weeks. Making
this change leaves the patch a little bit faster than the master
branch for TPC-C, while still leaving TPC-C indexes about as small as
they were with v6 of the patch (i.e. much smaller). I now get about a
1% improvement in transaction throughput, an improvement that seems
fairly consistent. It seems likely that the next revision of the patch
series will be an unambiguous across the board win for performance. I
think that I come out ahead with ASC heap TID order because that has
the effect of reducing the volume of WAL generated by page splits.
Page splits are already optimized for splitting right, not left.

I should thank Heikki for pointing me in the right direction here.

--
Peter Geoghegan

#36

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andrey Lepikhov (#34)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I applied your patches at top of master. After tests corrections
(related to TID ordering in index relations DROP...CASCADE operation)
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
it is true. You can verify it by tracking the
object_address_present_add_flags() routine return value.

I'll have to go and fix the problem directly, so that ASC sort order
can be used.

Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.

Since I'm going to have to fix the problem head-on, I'll have to study
it in detail anyway.

--
Peter Geoghegan

#37

Andrey Lepikhov

a.lepikhov@postgrespro.ru

about 7 years ago

In reply to: Peter Geoghegan (#36)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 04.11.2018 9:31, Peter Geoghegan wrote:

On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I applied your patches at top of master. After tests corrections
(related to TID ordering in index relations DROP...CASCADE operation)
'make check-world' passed successfully many times.
In the case of 'create view' regression test - 'drop cascades to 62
other objects' problem - I verify an Álvaro Herrera hypothesis [1] and
it is true. You can verify it by tracking the
object_address_present_add_flags() routine return value.

I'll have to go and fix the problem directly, so that ASC sort order
can be used.

Some doubts, however, may be regarding the 'triggers' test.
May you specify the test failures do you mean?

Not sure what you mean. The order of items that are listed in the
DETAIL for a cascading DROP can have an "implementation defined"
order. I think that this is an example of the more general problem --
what you call the 'drop cascades to 62 other objects' problem is a
more specific subproblem, or, if you prefer, a more specific symptom
of the same problem.

I mean that your code have not any problems that I can detect by
regression tests and by the retail index tuple deletion patch.
Difference in amount of dropped objects is not a problem. It is caused
by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file
catalog/dependency.c and it is legal behavior: column row object deleted
without any report because we already decided to drop its whole table.

Also, I checked the triggers test. Difference in the ERROR message
'cannot drop trigger trg1' is caused by different order of tuples in the
relation with the dependDependerIndexId relid. It is legal behavior and
we can simply replace test results.

May be you know any another problems of the patch?

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

#38

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Andrey Lepikhov (#37)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Nov 4, 2018 at 8:21 AM Andrey Lepikhov
<a.lepikhov@postgrespro.ru> wrote:

I mean that your code have not any problems that I can detect by
regression tests and by the retail index tuple deletion patch.
Difference in amount of dropped objects is not a problem. It is caused
by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file
catalog/dependency.c and it is legal behavior: column row object deleted
without any report because we already decided to drop its whole table.

The behavior implied by using ASC heap TID order is always "legal",
but it may cause a regression in certain functionality -- something
that an ordinary user might complain about. There were some changes
when DESC heap TID order is used too, of course, but those were safe
to ignore (it seemed like nobody could ever care). It might have been
okay to just use DESC order, but since it now seems like I must use
ASC heap TID order for performance reasons, I have to tackle a couple
of these issues head-on (e.g. 'cannot drop trigger trg1').

Also, I checked the triggers test. Difference in the ERROR message
'cannot drop trigger trg1' is caused by different order of tuples in the
relation with the dependDependerIndexId relid. It is legal behavior and
we can simply replace test results.

Let's look at this specific "trg1" case:

"""
 create table trigpart (a int, b int) partition by range (a);
 create table trigpart1 partition of trigpart for values from (0) to (1000);
 create trigger trg1 after insert on trigpart for each row execute
procedure trigger_nothing();
 ...
 drop trigger trg1 on trigpart1; -- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger
trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table
trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
"""

The original hint suggests "you need to drop the object on the
partition parent instead of its child", which is useful. The new hint
suggests "instead of dropping the trigger on the partition child,
maybe drop the child itself!". That's almost an insult to the user.

Now, I suppose that I could claim that it's not my responsibility to
fix this, since we get the useful behavior only due to accidental
implementation details. I'm not going to take that position, though. I
think that I am obliged to follow both the letter and the spirit of
the law. I'm almost certain that this regression test was written
because somebody specifically cared about getting the original, useful
message. The underlying assumptions may have been a bit shaky, but we
all know how common it is for software to evolve to depend on
implementation-defined details. We've all written code that does it,
but hopefully it didn't hurt us much because we also wrote regression
tests that exercised the useful behavior.

May be you know any another problems of the patch?

Just the lack of pg_upgrade support. That is progressing nicely,
though. I'll probably have that part in the next revision of the
patch. I've found what looks like a workable approach, though I need
to work on a testing strategy for pg_upgrade.

--
Peter Geoghegan

#39

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#38)

6 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Nov 4, 2018 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

Just the lack of pg_upgrade support.

Attached is v7 of the patch series. Changes:

* Pre-pg_upgrade indexes (indexes of an earlier BTREE_VERSION) are now
supported. Using pg_upgrade will be seamless to users. "Getting tired"
returns, for the benefit of old indexes that regularly have lots of
duplicates inserted.

Notably, the new/proposed version of btree (BTREE_VERSION 4) cannot be
upgraded on-the-fly -- we're changing more than the contents of the
metapage, so that won't work. Version 2 -> version 3 upgrades can
still take place dynamically/on-the-fly. It you want to upgrade to
version 4, you'll need to REINDEX. The performance of the patch with
pg_upgrade'd indexes has been validated. There doesn't seem to be any
regressions.

amcheck checks both the old invariants, and the new/stricter/L&Y
invariants. Which set are checked depends on the btree version of the
index undergoing verification.

* ASC heap TID order is now used -- not DESC order, as before. This
fixed all performance regressions that I'm aware of, and seems quite a
lot more elegant overall.

I believe that the patch series is now an unambiguous, across the
board win for performance. I could see about a 1% increase in
transaction throughput with my own TPC-C tests, while the big drop in
the size of indexes was preserved. pgbench testing also showed as much
as a 3.5% increase in transaction throughput in some cases with
non-uniform distributions. Thanks for the suggestion, Heikki!

Unfortunately, and as predicted, this change created a new problem
that I need to fix directly: it makes certain diagnostic messages that
accidentally depend on a certain pg_depend scan order say something
different, and less useful (though still technically correct). I'll
tackle that problem over on the dedicated thread I started [1]/messages/by-id/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com. (For
now, I include a separate patch to paper over questionable regression
test changes in a controlled way:
v7-0005-Temporarily-paper-over-problematic-regress-output.patch.)

* New optimization that has index scans avoid visiting the next page
by checking the high key -- this is broken out into its own commit
(v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch).

This is related to an optimization that has been around for years --
we're now using the high key, rather than using a normal (non-pivot)
index tuple. High keys are much more likely to indicate that the scan
doesn't need to visit the next page with the earlier patches in the
patch series applied, since the new logic for choosing a split point
favors a high key with earlier differences. It's pretty easy to take
advantage of that. With a composite index, or a secondary index, it's
particularly likely that we can avoid visiting the next leaf page. In
other words, now that we're being smarter about future locality of
access during page splits, we should take full advantage during index
scans.

The v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
commit uses a _bt_lowest_scantid() sentinel value to avoid
unnecessarily visiting a page to the left of the page we actually
ought to go to directly during a descent of a B-Tree -- that
optimization was around in all earlier versions of the patch series.
It seems natural to also have this new-to-v7 optimization. It avoids
unnecessarily going right once we reach the leaf level, so it "does
the same thing on the right side" -- the two optimizations mirror each
other. If you don't get what I mean by that, then imagine a secondary
index where each value appears a few hundred times. Literally every
simple lookup query will either benefit from the first optimization on
the way down the tree, or from the second optimization towards the end
of the scan. (The page split logic ought to pack large groups of
duplicates together, ideally confining them to one leaf page.)

Andrey: the BTScanInsert struct still has the restorebinsrch stuff
(mutable binary search optimization state) in v7. It seemed to make
sense to keep it there, because I think that we'll be able to add
similar optimizations in the future, that use similar mutable state.
See my remarks on "dynamic prefix truncation" [2]/messages/by-id/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com -- Peter Geoghegan. I think that that
could be very helpful with skip scans, for example, so we'll probably
end up adding it before too long. I hope you don't feel too strongly
about it.

[1]: /messages/by-id/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
[2]: /messages/by-id/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v7-0005-Temporarily-paper-over-problematic-regress-output.patchapplication/x-patch; name=v7-0005-Temporarily-paper-over-problematic-regress-output.patchDownload

From b5e992de5146166d6c5c50a169c9e1ede6f1f785 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Nov 2018 10:27:16 -0800
Subject: [PATCH v7 5/6] Temporarily paper-over problematic regress output.

The unique-nbtree-keys patch series unmasked the fact that
findDependentObjects() has a dependency on scan order that causes
certain diagnostic messages to be less useful, though still technically
accurate.  Similar issues exist when "ignore_system_indexes=on" is used.
Affected messages often involve DROP ... CASCADE and table partitioning,
though there are probably other areas that are affected.

I (Peter Geoghegan) have resolved to deal with these issues head-on.  A
dedicated -hackers thread has been started to discuss the issue [1].
However, since the problem is currently unresolved, temporarily paper
over the problematic regression test output.

[1] https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
---
 contrib/earthdistance/expected/earthdistance.out |  2 +-
 src/test/regress/expected/create_view.out        |  2 +-
 src/test/regress/expected/event_trigger.out      | 16 ++++++++--------
 src/test/regress/expected/indexing.out           | 12 ++++++------
 src/test/regress/expected/triggers.out           | 12 ++++++------
 5 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/contrib/earthdistance/expected/earthdistance.out b/contrib/earthdistance/expected/earthdistance.out
index 26a843c3fa..4395e619de 100644
--- a/contrib/earthdistance/expected/earthdistance.out
+++ b/contrib/earthdistance/expected/earthdistance.out
@@ -972,7 +972,7 @@ SELECT abs(cube_distance(ll_to_earth(-30,-90), '(0)'::cube) / earth() - 1) <
 
 drop extension cube;  -- fail, earthdistance requires it
 ERROR:  cannot drop extension cube because other objects depend on it
-DETAIL:  extension earthdistance depends on extension cube
+DETAIL:  extension earthdistance depends on function cube_out(cube)
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 drop extension earthdistance;
 drop type cube;  -- fail, extension cube requires it
diff --git a/src/test/regress/expected/create_view.out b/src/test/regress/expected/create_view.out
index 141fc6da62..8abcd7b3d9 100644
--- a/src/test/regress/expected/create_view.out
+++ b/src/test/regress/expected/create_view.out
@@ -1711,4 +1711,4 @@ select pg_get_ruledef(oid, true) from pg_rewrite
 DROP SCHEMA temp_view_test CASCADE;
 NOTICE:  drop cascades to 27 other objects
 DROP SCHEMA testviewschm2 CASCADE;
-NOTICE:  drop cascades to 62 other objects
+NOTICE:  drop cascades to 63 other objects
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0755931db8..ec4f6c300f 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -406,19 +406,19 @@ DROP INDEX evttrig.one_idx;
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={}
 DROP SCHEMA evttrig CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table evttrig.one
+DETAIL:  drop cascades to table evttrig.parted
 drop cascades to table evttrig.two
-drop cascades to table evttrig.parted
+drop cascades to table evttrig.one
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=schema identity=evttrig name={evttrig} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.parted name={evttrig,parted} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_10_20 name={evttrig,part_10_20} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_15_20 name={evttrig,part_15_20} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_10_15 name={evttrig,part_10_15} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_1_10 name={evttrig,part_1_10} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.one name={evttrig,one} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=sequence identity=evttrig.one_col_a_seq name={evttrig,one_col_a_seq} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=default value identity=for evttrig.one.col_a name={evttrig,one,col_a} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.parted name={evttrig,parted} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_1_10 name={evttrig,part_1_10} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_10_20 name={evttrig,part_10_20} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_10_15 name={evttrig,part_10_15} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.part_15_20 name={evttrig,part_15_20} args={}
 DROP TABLE a_temp_tbl;
 NOTICE:  NORMAL: orig=t normal=f istemp=t type=table identity=pg_temp.a_temp_tbl name={pg_temp,a_temp_tbl} args={}
 DROP EVENT TRIGGER regress_event_trigger_report_dropped;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index ca27346f18..17e8d3f136 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -148,8 +148,8 @@ create table idxpart (a int) partition by range (a);
 create index on idxpart (a);
 create table idxpart1 partition of idxpart for values from (0) to (10);
 drop index idxpart1_a_idx;	-- no way
-ERROR:  cannot drop index idxpart1_a_idx because index idxpart_a_idx requires it
-HINT:  You can drop index idxpart_a_idx instead.
+ERROR:  cannot drop index idxpart1_a_idx because column a of table idxpart1 requires it
+HINT:  You can drop column a of table idxpart1 instead.
 drop index idxpart_a_idx;	-- both indexes go away
 select relname, relkind from pg_class
   where relname like 'idxpart%' order by relname;
@@ -998,11 +998,11 @@ select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid
 (3 rows)
 
 drop index idxpart0_pkey;								-- fail
-ERROR:  cannot drop index idxpart0_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart0_pkey because constraint idxpart0_pkey on table idxpart0 requires it
+HINT:  You can drop constraint idxpart0_pkey on table idxpart0 instead.
 drop index idxpart1_pkey;								-- fail
-ERROR:  cannot drop index idxpart1_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart1_pkey because constraint idxpart1_pkey on table idxpart1 requires it
+HINT:  You can drop constraint idxpart1_pkey on table idxpart1 instead.
 alter table idxpart0 drop constraint idxpart0_pkey;		-- fail
 ERROR:  cannot drop inherited constraint "idxpart0_pkey" of relation "idxpart0"
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 70b7c6eead..ae04a89622 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -1896,14 +1896,14 @@ select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
 (4 rows)
 
 drop trigger trg1 on trigpart1;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
 drop trigger trg1 on trigpart2;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart2 because table trigpart2 requires it
+HINT:  You can drop table trigpart2 instead.
 drop trigger trg1 on trigpart3;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart3 because table trigpart3 requires it
+HINT:  You can drop table trigpart3 instead.
 drop table trigpart2;			-- ok, trigger should be gone in that partition
 select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
   where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
-- 
2.17.1

v7-0006-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v7-0006-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 3d8d375671e0361b3378ca204ae1d96de4a923de Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v7 6/6] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bee1f1c9d9..cea927d01b 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -242,6 +243,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -253,9 +255,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -264,6 +266,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -282,16 +286,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_hasuniquekeys(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -365,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -396,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -481,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v7-0004-Add-high-key-continuescan-optimization.patchapplication/x-patch; name=v7-0004-Add-high-key-continuescan-optimization.patchDownload

From cdaf4608ce14aa2b1fef4b87d9d9191db5dc136a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v7 4/6] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++---
 src/backend/access/nbtree/nbtutils.c  | 60 +++++++++++++++++++++------
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7305e647b2..0dc4a4ac98 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1429,7 +1429,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber maxoff;
 	int			itemIndex;
 	IndexTuple	itup;
-	bool		continuescan;
+	bool		continuescan = true;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1497,16 +1497,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you'd
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples within a range of acceptable split points.  There
+		 * is often natural locality around what ends up on each leaf page,
+		 * which is worth taking advantage of here.
+		 */
+		if (!P_RIGHTMOST(opaque) && continuescan)
+			(void) _bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 982115549a..9a1988fc58 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
 				IndexTuple firstright, bool build);
@@ -1393,7 +1393,10 @@ _bt_mark_scankey_required(ScanKey skey)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1403,6 +1406,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1416,21 +1420,24 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		Assert(offnum != P_HIKEY || P_RIGHTMOST(opaque));
 		if (ScanDirectionIsForward(dir))
 		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
+			/* forward scan callers check high key instead */
+			return NULL;
 		}
 		else
 		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
+			/* return immediately if there are more tuples on the page */
 			if (offnum > P_FIRSTDATAKEY(opaque))
 				return NULL;
 		}
@@ -1445,6 +1452,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1456,11 +1464,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1591,8 +1612,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1609,6 +1630,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patchapplication/x-patch; name=v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patchDownload

From 7aec8a7529c298f3aef412207a01c72a75831bf5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v7 2/6] Weigh suffix truncation when choosing a split point.

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.  Use this
within _bt_findsplitloc() to weigh how effective suffix truncation will
be.  This is primarily useful because it maximizes the effectiveness of
suffix truncation.  This should not noticeably affect the balance of
free space within each half of the split.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/README      |  65 ++-
 src/backend/access/nbtree/nbtinsert.c | 618 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  |  78 ++++
 src/include/access/nbtree.h           |   8 +-
 4 files changed, 698 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 43545311da..052890452f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -159,9 +159,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -670,6 +670,65 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of split point can be thought of as a choice among points "between"
+items on the page to be split, at least if you pretend that the incoming
+tuple was placed on the page already, without provoking a split.  The split
+point between two index tuples with differences that appear as early as
+possible allows us to truncate away as many attributes as possible.
+
+Obviously suffix truncation is valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been, without actually making any pivot tuple smaller due to restrictions
+relating to alignment.  The criteria for choosing a leaf page split point
+for suffix truncation is often also predictive of future space utilization.
+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which leaf pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the optimal fillfactor-wise split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+As of Postgres v12, every tuple at the leaf level must be individually
+locatable by an insertion scankey that's fully filled-out by
+_bt_mkscankey().  Heap TID is treated as a tie-breaker key attribute to make
+this work.  Suffix truncation must occasionally make a pivot tuple *larger*
+than the leaf tuple that it's based on, since a heap TID must be appended
+when nothing else distinguishes each side of a leaf split.  This is not
+represented in the same way as it is at the leaf level (we must append an
+additional attribute), since pivot tuples already use the generic IndexTuple
+fields to describe which child page they point to, and how many attributes
+are in the pivot tuple.  Adding a heap TID attribute during a leaf page
+split should only occur when there is an entire page full of duplicates,
+though, since the logic for selecting a split point will do all it can to
+avoid this outcome --- it may apply "many duplicates" mode, or "single
+value" mode.
+
+Avoiding appending a heap TID to a pivot tuple is about much more than just
+saving a single MAXALIGN() quantum in each of the pages that store the new
+pivot.  It's worth going out of our way to avoid having a single value (or
+composition of key values) span two leaf pages when that isn't truly
+necessary, since if that's allowed to happen every point index scan will
+have to visit both pages.  It also makes it less likely that VACUUM will be
+able to perform page deletion on either page.  Finally, it's not unheard of
+for unique indexes to have pages full of duplicates in the event of extreme
+contention (which appears as buffer lock contention) --- this is also
+ameliorated.  These are all examples of how "false sharing" across B-Tree
+pages can cause performance problems.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 77bc6ee9b3..7d5710ae09 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,25 +28,41 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+#define STACK_SPLIT_POINTS			7
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	double		fillfactor;		/* needed for weighted splits */
+	int			goodenough;
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	bool		hikeyheaptid;	/* T if high key will likely get heap TID */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +92,22 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page, bool is_leaf,
+					SplitMode mode, OffsetNumber newitemoff,
+					IndexTuple newitem, int nsplits,
+					SplitPoint *splits, SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -950,8 +976,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/*
@@ -1641,6 +1667,30 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Generally speaking, only candidate split points that fall within
+ * an acceptable space utilization range are considered.  This is even
+ * useful with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new pivot
+ * tuple (high key/downlink) when it cannot actually be truncated.  Note that
+ * suffix truncation is always assumed, even with pg_upgrade'd indexes that
+ * won't actually go on to perform truncation.  There is still a modest
+ * benefit to choosing a split location while weighing suffix truncation: the
+ * resulting (untruncated) pivot tuples are nevertheless more predictive of
+ * future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1649,6 +1699,16 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that's even more left-heavy than the fillfactor used for
+ * rightmost pages.  This greatly helps with space management in cases where
+ * tuples with the same attribute values span multiple pages.  Newly
+ * inserted duplicates will tend to have higher heap TID values, so we'll
+ * end up splitting to the right in the manner of ascending insertions of
+ * monotonically increasing values.  See nbtree/README for more information
+ * about suffix truncation, and how a split point is chosen.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1661,8 +1721,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1672,15 +1734,16 @@ _bt_findsplitloc(Relation rel,
 	FindSplitData state;
 	int			leftspace,
 				rightspace,
-				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectpenalty;
 	bool		goodenoughfound;
+	SplitPoint	splits[STACK_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
@@ -1698,18 +1761,44 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.hikeyheaptid = (mode == SPLIT_SINGLE_VALUE);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (mode != SPLIT_SINGLE_VALUE)
+		{
+			/* Only used on rightmost page */
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR) / 100.0;
+		}
+		else
+		{
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+			state.is_weighted = true;
+		}
+	}
 	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+	{
+		Assert(mode == SPLIT_DEFAULT);
+		/* Only used on rightmost page */
+		state.fillfactor = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+
+	if (mode == SPLIT_DEFAULT)
+		state.maxsplits = Min(Max(1, maxoff / 16), STACK_SPLIT_POINTS);
+	else if (mode == SPLIT_MANY_DUPLICATES)
+		state.maxsplits = maxoff + 2;
+	else
+		state.maxsplits = 1;
+	state.nsplits = 0;
+	if (mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1718,13 +1807,20 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth in default mode; instead,
+	 * stop as soon as we find all "good-enough" splits, where good-enough is
+	 * defined as an imbalance in free space of no more than pagesize/16
+	 * (arbitrary...) This should let us stop near the middle on most pages,
+	 * instead of plowing to the end.  Many duplicates mode must consider all
+	 * possible choices, and we set its goodenough rather high so we can
+	 * distinguish between marginal split points within _bt_checksplitloc.
+	 * Single value mode gives up as soon as it finds a good enough split
+	 * point.
 	 */
-	goodenough = leftspace / 16;
+	if (mode != SPLIT_MANY_DUPLICATES)
+		state.goodenough = leftspace / 16;
+	else
+		state.goodenough = leftspace;
 
 	/*
 	 * Scan through the data items and calculate space usage for a split at
@@ -1732,13 +1828,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1747,27 +1843,41 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
 			goodenoughfound = true;
-			break;
+
+		/*
+		 * Abort default mode scan once we've found a good-enough choice, and
+		 * reach the point where we stop finding new good-enough choices.
+		 * Might as well abort as soon as good-enough split point is found in
+		 * single value mode, where we won't discriminate between a selection
+		 * of split points based on their penalty.
+		 */
+		if (goodenoughfound)
+		{
+			if (mode == SPLIT_DEFAULT && delta > state.goodenough)
+				break;
+			else if (mode == SPLIT_SINGLE_VALUE)
+				break;
+
+			/* Many duplicates mode must be exhaustive */
 		}
 
 		olddataitemstoleft += itemsz;
@@ -1778,19 +1888,52 @@ _bt_findsplitloc(Relation rel,
 	 * the old items go to the left page and the new item goes to the right
 	 * page.
 	 */
-	if (newitemoff > maxoff && !goodenoughfound)
+	if (newitemoff > maxoff &&
+		(!goodenoughfound || mode == SPLIT_MANY_DUPLICATES))
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to increase fan-out, by choosing a split point which is
+	 * amenable to being made smaller by suffix truncation, or is already
+	 * small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  This will be passed to _bt_bestsplitloc() if it
+	 * determines that candidate split points are good enough to finish
+	 * default mode split.  Perfect penalty saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, state.is_leaf, mode,
+										 newitemoff, newitem,
+										 state.nsplits, state.splits,
+										 &secondmode);
+
+	/* Start second pass over page if _bt_perfect_penalty() told us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1805,8 +1948,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1814,7 +1960,8 @@ _bt_checksplitloc(FindSplitData *state,
 				  Size firstoldonrightsz)
 {
 	int			leftfree,
-				rightfree;
+				rightfree,
+				leftleafheaptidsz;
 	Size		firstrightitemsz;
 	bool		newitemisfirstonright;
 
@@ -1834,15 +1981,38 @@ _bt_checksplitloc(FindSplitData *state,
 
 	/*
 	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
 	 */
 	leftfree -= firstrightitemsz;
 
+	/*
+	 * Assume that suffix truncation cannot avoid adding a heap TID to the
+	 * left half's new high key when splitting at the leaf level.  Don't let
+	 * this impact the balance of free space in the common case where adding a
+	 * heap TID is considered very unlikely, though, since there is no reason
+	 * to accept a likely-suboptimal split.
+	 *
+	 * When adding a heap TID seems likely, then actually factor that in to
+	 * delta calculation, rather than just having it as a constraint on
+	 * whether or not a split is acceptable.
+	 */
+	leftleafheaptidsz = 0;
+	if (state->is_leaf)
+	{
+		if (!state->hikeyheaptid)
+			leftleafheaptidsz = sizeof(ItemPointerData);
+		else
+			leftfree -= (int) sizeof(ItemPointerData);
+	}
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -1858,20 +2028,23 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
-	if (leftfree >= 0 && rightfree >= 0)
+	if (leftfree - leftleafheaptidsz >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
+				- ((1.0 - state->fillfactor) * rightfree);
 		}
 		else
 		{
@@ -1881,14 +2054,325 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		/* Don't recognize differences among marginal split points */
+		if (delta > state->goodenough)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/* No point calculating penalties in trivial cases */
+	if (perfectpenalty == INT_MAX || state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	/*
+	 * Now actually search among acceptable split points for the entry that
+	 * allows suffix truncation to truncate away the maximum possible number
+	 * of attributes.
+	 */
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
 		}
 	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is the common case.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, bool is_leaf, SplitMode mode,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					int nsplits, SplitPoint *splits, SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 */
+	if (!is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+	else if (mode == SPLIT_MANY_DUPLICATES)
+		return IndexRelationGetNumberOfKeyAttributes(rel);
+	else if (mode == SPLIT_SINGLE_VALUE)
+		return INT_MAX;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = splits[0].firstright;
+
+	/*
+	 * Split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  This is slightly
+	 * arbitrary.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  No great care is taken around boundary cases where the
+	 * center split point has the same firstright offset as either the
+	 * leftmost or rightmost split points (i.e. only newitemonleft differs).
+	 * We expect to find leftmost and rightmost tuples almost immediately.
+	 */
+	perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel);
+	for (int j = nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/* Work out which type of second pass will be performed, if any */
+	if (perfectpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			outerpenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		outerpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates, a single value mode pass will be performed.
+		 *
+		 * Caller should avoid a single value mode pass when incoming tuple
+		 * doesn't sort lowest among items on the page, though.  Instead, we
+		 * instruct caller to continue with original default mode split, since
+		 * an out-of-order new item suggests that newer tuples have come from
+		 * (non-HOT) updates, not inserts.  Evenly sharing space among each
+		 * half of the split avoids pathological performance.
+		 */
+		if (outerpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+		{
+			if (maxoff < newitemoff)
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				perfectpenalty = INT_MAX;
+				*secondmode = SPLIT_DEFAULT;
+			}
+		}
+		else
+			*secondmode = SPLIT_MANY_DUPLICATES;
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID needs to be appending during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	lastleft;
+	IndexTuple	firstright;
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	Assert(lastleft != firstright);
+	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 83298ff257..982115549a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2325,6 +2326,83 @@ _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return leavenatts;
 }
 
+/*
+ * _bt_leave_natts_fast - fast, approximate variant of _bt_leave_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_leave_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_leave_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			result;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases. However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_leave_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_leave_natts(rel, lastleft, firstright);
+#endif
+
+	result = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		result++;
+	}
+
+	return result;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7baef7d685..fc0f7dea18 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -144,11 +144,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -705,6 +709,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 					IndexTuple firstright, bool build);
+extern int _bt_leave_natts_fast(Relation rel, IndexTuple lastleft,
+					 IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
 					 IndexTuple newtup);
-- 
2.17.1

v7-0003-Add-split-at-new-tuple-page-split-optimization.patchapplication/x-patch; name=v7-0003-Add-split-at-new-tuple-page-split-optimization.patchDownload

From aa5e31644a490b21610957556ec34a7c468db928 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v7 3/6] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing adjacent heap TIDs.  Only non-rightmost pages are
affected, to preserve existing behavior.

This enhancement is new to version 6 of the patch series.  This
enhancement has been demonstrated to be very effective at avoiding index
bloat when initial bulk INSERTs for the TPC-C benchmark are run.
Evidently, the primary keys for all of the largest indexes in the TPC-C
schema are populated through localized, monotonically increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.  (Note that the patch series generally has very
little advantage over master if the indexes are rebuilt via a REINDEX,
with or without this later commit.)

I (Peter Geoghegan) will provide reviewers with a convenient copy of
this test data if asked.  It comes from the oltpbench fair-use
implementation of TPC-C [1], but the same issue has independently been
observed with the BenchmarkSQL implementation of TPC-C [2].

Note that this commit also recognizes and prevents bloat with
monotonically *decreasing* tuple insertions (e.g., single-DESC-attribute
index on a date column).  Affected cases will typically leave their
index structure slightly smaller than an equivalent monotonically
increasing case would.

[1] http://oltpbenchmark.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
---
 src/backend/access/nbtree/nbtinsert.c | 191 +++++++++++++++++++++++++-
 1 file changed, 189 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7d5710ae09..296d42dc3b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -97,6 +97,8 @@ static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_dosplitatnewitem(Relation rel, Page page,
+					OffsetNumber newitemoff, IndexTuple newitem);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -108,6 +110,7 @@ static int _bt_perfect_penalty(Relation rel, Page page, bool is_leaf,
 					SplitPoint *splits, SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -1697,7 +1700,13 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * etc) we will end up with a tree whose pages are about fillfactor% full,
  * instead of the 50% full result that we'd get without this special case.
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * the fillfactor% is determined dynamically when _bt_dosplitatnewitem()
+ * indicates that there are localized monotonically increasing insertions,
+ * or monotonically decreasing (DESC order) insertions. (This can only
+ * happen with the default strategy, and should be thought of as a variant
+ * of the fillfactor% special case that is applied only when inserting into
+ * non-rightmost pages.)
  *
  * If called recursively in single value mode, we also try to arrange to
  * leave the left split page fillfactor% full, though we arrange to use a
@@ -1768,7 +1777,28 @@ _bt_findsplitloc(Relation rel,
 	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
 	{
-		if (mode != SPLIT_SINGLE_VALUE)
+		/*
+		 * Consider split at new tuple optimization.  See
+		 * _bt_dosplitatnewitem() for an explanation.
+		 */
+		if (mode == SPLIT_DEFAULT && !P_RIGHTMOST(opaque) &&
+			_bt_dosplitatnewitem(rel, page, newitemoff, newitem))
+		{
+			/*
+			 * fillfactor% is dynamically set through interpolation of the
+			 * new/incoming tuple's offset position
+			 */
+			if (newitemoff > maxoff)
+				state.fillfactor = (double) BTREE_DEFAULT_FILLFACTOR / 100.0;
+			else if (newitemoff == P_FIRSTDATAKEY(opaque))
+				state.fillfactor = (double) BTREE_MIN_FILLFACTOR / 100.0;
+			else
+				state.fillfactor =
+					((double) newitemoff / (((double) maxoff + 1)));
+
+			state.is_weighted = true;
+		}
+		else if (mode != SPLIT_SINGLE_VALUE)
 		{
 			/* Only used on rightmost page */
 			state.fillfactor = RelationGetFillFactor(rel,
@@ -2096,6 +2126,126 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at
+ * approximately the point that the new/incoming item would have been
+ * inserted.
+ *
+ * This routine infers two distinct cases in which splitting around the new
+ * item's insertion point is likely to lead to better space utilization over
+ * time:
+ *
+ * - Composite indexes that consist of one or more leading columns that
+ *   describe some grouping, plus a trailing, monotonically increasing
+ *   column.  If there happened to only be one grouping then the traditional
+ *   rightmost page split default fillfactor% would be used to good effect,
+ *   so it seems worth recognizing this case.  This usage pattern is
+ *   prevalent in the TPC-C benchmark, and is assumed to be common in real
+ *   world applications.
+ *
+ * - DESC-ordered insertions, including DESC-ordered single (non-heap-TID)
+ *   key attribute indexes.  We don't want the performance of explicitly
+ *   DESC-ordered indexes to be out of line with an equivalent ASC-ordered
+ *   index.  Also, there may be organic cases where items are continually
+ *   inserted in DESC order for an index with ASC sort order.
+ *
+ * Caller uses fillfactor% rather than using the new item offset directly
+ * because it allows suffix truncation to be applied using the usual
+ * criteria, which can still be helpful.  This approach is also more
+ * maintainable, since restrictions on split points can be handled in the
+ * usual way.
+ *
+ * Localized insert points are inferred here by observing that neighboring
+ * heap TIDs are "adjacent".  For example, if the new item has distinct key
+ * attribute values to the existing item that belongs to its immediate left,
+ * and the item to its left has a heap TID whose offset is exactly one less
+ * than the new item's offset, then caller is told to use its new-item-split
+ * strategy.  It isn't of much consequence if this routine incorrectly
+ * infers that an interesting case is taking place, provided that that
+ * doesn't happen very often.  In particular, it should not be possible to
+ * construct a test case where the routine consistently does the wrong
+ * thing.  Since heap TID "adjacency" is such a delicate condition, and
+ * since there is no reason to imagine that random insertions should ever
+ * consistent leave new tuples at the first or last position on the page
+ * when a split is triggered, that will never happen.
+ *
+ * Note that we avoid using the split-at-new fillfactor% when we'd have to
+ * append a heap TID during suffix truncation.  We also insist that there
+ * are no varwidth attributes or NULL attribute values in new item, since
+ * that invalidates interpolating from the new item offset.  Besides,
+ * varwidths generally imply the use of datatypes where ordered insertions
+ * are not a naturally occurring phenomenon.
+ */
+static bool
+_bt_dosplitatnewitem(Relation rel, Page page, OffsetNumber newitemoff,
+					 IndexTuple newitem)
+{
+	ItemId		itemid;
+	OffsetNumber maxoff;
+	BTPageOpaque opaque;
+	IndexTuple	tup;
+	int16		nkeyatts;
+
+	if (IndexTupleHasNulls(newitem) || IndexTupleHasVarwidths(newitem))
+		return false;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Avoid optimization entirely on pages with large items */
+	if (maxoff <= 3)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/*
+	 * When heap TIDs appear in DESC order, consider left-heavy split.
+	 *
+	 * Accept left-heavy split when new item, which will be inserted at first
+	 * data offset, has adjacent TID to extant item at that position.
+	 */
+	if (newitemoff == P_FIRSTDATAKEY(opaque))
+	{
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/* Single key indexes only use DESC optimization */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * When tuple heap TIDs appear in ASC order, consider right-heavy split,
+	 * even though this may not be the right-most page.
+	 *
+	 * Accept right-heavy split when new item, which belongs after any
+	 * existing page offset, has adjacent TID to extant item that's the last
+	 * on the page.
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/*
+	 * When new item is approximately in the middle of the page, look for
+	 * adjacency among new item, and extant item that belongs to the left of
+	 * the new item in the keyspace.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+
+	return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+		_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -2375,6 +2525,43 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction, and probably not through heap_update().  This is not a
+ * commutative condition.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+	OffsetNumber lowoff,
+				highoff;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Not adjacent when blocks are not equal or highblk not one greater */
+	if (lowblk != highblk && lowblk + 1 != highblk)
+		return false;
+
+	lowoff = ItemPointerGetOffsetNumber(lowhtid);
+	highoff = ItemPointerGetOffsetNumber(highhtid);
+
+	/* When heap blocks match, second offset should be one up */
+	if (lowblk == highblk && OffsetNumberNext(lowoff) == highoff)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk && highoff == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
-- 
2.17.1

v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchapplication/x-patch; name=v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patchDownload

From 065b2861c0acec3125fbb67336a6461e64dca285 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v7 1/6] Make nbtree indexes have unique keys in tuples.

Make nbtree treat all index tuples as having a heap TID trailing
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID.  This general approach has numerous benefits for performance, and
is prerequisite to teaching VACUUM to perform "retail index tuple
deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is also introduced.  This will generally truncate away the
"extra" heap TID attribute from pivot tuples during a leaf page split,
and may also truncate away additional user attributes.  This can
increase fan-out, especially when there are several attributes in an
index.  Truncation can only occur at the attribute granularity, which
isn't particularly effective, but works well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with BTREE_VERSIONs 2 and
3, while also enforcing the newer/more strict invariants with
BTREE_VERSION 4 indexes.

We no longer allow a search for free space among multiple pages full of
duplicates to "get tired", except when needed to preserve compatibility
with earlier versions.  This has significant benefits for free space
management in secondary indexes on low cardinality attributes.  However,
without the next commit in the patch series (without having "single
value" mode and "many duplicates" mode within _bt_findsplitloc()), these
cases will be significantly regressed, since they'll naively perform
50:50 splits without there being any hope of reusing space left free on
the left half of the split.
---
 contrib/amcheck/verify_nbtree.c               | 266 +++++++--
 contrib/file_fdw/output/file_fdw.source       |  10 +-
 contrib/pageinspect/btreefuncs.c              |   2 +-
 contrib/pageinspect/expected/btree.out        |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out  |  10 +-
 src/backend/access/nbtree/README              | 162 ++++--
 src/backend/access/nbtree/nbtinsert.c         | 525 +++++++++++-------
 src/backend/access/nbtree/nbtpage.c           | 199 +++++--
 src/backend/access/nbtree/nbtree.c            |   2 +-
 src/backend/access/nbtree/nbtsearch.c         | 363 +++++++++---
 src/backend/access/nbtree/nbtsort.c           |  80 +--
 src/backend/access/nbtree/nbtutils.c          | 365 ++++++++++--
 src/backend/access/nbtree/nbtxlog.c           |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c         |   8 -
 src/backend/utils/sort/tuplesort.c            |  11 +-
 src/include/access/nbtree.h                   | 158 +++++-
 src/include/access/nbtxlog.h                  |  20 +-
 .../expected/test_extensions.out              |   6 +-
 src/test/regress/expected/aggregates.out      |   4 +-
 src/test/regress/expected/alter_table.out     |  20 +-
 src/test/regress/expected/collate.out         |   3 +-
 src/test/regress/expected/create_type.out     |   8 +-
 src/test/regress/expected/dependency.out      |   4 +-
 src/test/regress/expected/domain.out          |   4 +-
 src/test/regress/expected/event_trigger.out   |  69 ++-
 src/test/regress/expected/foreign_data.out    |   8 +-
 src/test/regress/expected/foreign_key.out     |   4 +-
 src/test/regress/expected/inherit.out         |  16 +-
 src/test/regress/expected/matview.out         |  18 +-
 src/test/regress/expected/rowsecurity.out     |   4 +-
 src/test/regress/expected/select_into.out     |   4 +-
 src/test/regress/expected/triggers.out        |   4 +-
 src/test/regress/expected/truncate.out        |   5 +-
 src/test/regress/expected/typed_table.out     |  11 +-
 src/test/regress/expected/updatable_views.out |  56 +-
 src/test/regress/output/tablespace.source     |   8 +-
 src/test/regress/sql/collate.sql              |   2 +
 src/test/regress/sql/domain.sql               |   2 +
 src/test/regress/sql/foreign_key.sql          |   2 +
 src/test/regress/sql/truncate.sql             |   2 +
 src/test/regress/sql/typed_table.sql          |   2 +
 src/test/regress/sql/updatable_views.sql      |  20 +
 src/tools/pgindent/typedefs.list              |   4 +
 43 files changed, 1736 insertions(+), 780 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..09bcd4442d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,28 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+							  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
+static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +843,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +911,15 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		skey = _bt_mkscankey(state->rel, itup, false);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -956,11 +972,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1032,20 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			IndexTuple	righttup;
+			BTScanInsert rightkey;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+				rightkey = _bt_mkscankey(state->rel, righttup, false);
+
+			if (righttup && !invariant_g_offset(state, rightkey, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1088,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1083,9 +1102,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1117,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1306,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1304,8 +1322,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1372,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1423,13 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1769,60 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->uniquekeys)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1831,104 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->uniquekeys)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  That could cause
+	 * us to miss the fact that the scankey is less than rather than equal to
+	 * its lower bound, but the index is corrupt either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->uniquekeys)
+		return cmp <= 0;
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid =
+			BTreeTupleGetHeapTIDCareful(state, child, P_ISLEAF(copaque));
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2084,32 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 853c9f9b28..42bf16ba70 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -432,10 +432,10 @@ RESET ROLE;
 DROP EXTENSION file_fdw CASCADE;
 NOTICE:  drop cascades to 7 other objects
 DETAIL:  drop cascades to server file_server
-drop cascades to user mapping for regress_file_fdw_superuser on server file_server
-drop cascades to user mapping for regress_no_priv_user on server file_server
-drop cascades to foreign table agg_text
-drop cascades to foreign table agg_csv
-drop cascades to foreign table agg_bad
 drop cascades to foreign table text_csv
+drop cascades to foreign table agg_bad
+drop cascades to foreign table agg_csv
+drop cascades to foreign table agg_text
+drop cascades to user mapping for regress_no_priv_user on server file_server
+drop cascades to user mapping for regress_file_fdw_superuser on server file_server
 DROP ROLE regress_file_fdw_superuser, regress_file_fdw_user, regress_no_priv_user;
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 184ac62255..bee1f1c9d9 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -560,7 +560,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_META_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..43545311da 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -34,30 +34,47 @@ Differences to the Lehman & Yao algorithm
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
+The requirement that all btree keys be unique is satisfied by treating
+heap TID as a tie-breaker attribute.  Logical duplicates are sorted in
+heap item pointer order.  We don't use btree keys to disambiguate
+downlinks from the internal pages during a page split, though: only
+one entry in the parent level will be pointing at the page we just
+split, so the link fields can be used to re-find downlinks in the
+parent via a linear search.
 
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+Lehman and Yao require that the key range for a subtree S is described
+by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the
+parent page, but do not account for the need to search the tree based
+only on leading index attributes in a composite index.  Since heap TID
+is always used to make btree keys unique (even in unique indexes),
+every btree index is treated as a composite index internally.  A
+search that finds exact equality to a pivot tuple in an upper tree
+level must descend to the left of that key to ensure it finds any
+equal keys, even when scan values were provided for all attributes.
+An insertion that sees that the high key of its target page is equal
+to the key to be inserted cannot move right, since the downlink for
+the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity
+downlink).
+
+We might be able to avoid moving left in the event of a full match on
+all attributes up to and including the heap TID attribute, but that
+would be a very narrow win, since it's rather unlikely that heap TID
+will be an exact match.  We can avoid moving left unnecessarily when
+all user-visible keys are equal by avoiding exact equality;  a
+sentinel value that's less than any possible heap TID is used by most
+index scans.  This is effective because of suffix truncation.  An
+"extra" heap TID attribute in pivot tuples is almost always avoided.
+All truncated attributes compare as minus infinity, even against a
+sentinel value, and the sentinel value is less than any real TID
+value, so an unnecessary move to the left is avoided regardless of
+whether or not a heap TID is present in the otherwise-equal pivot
+tuple.  Consistently moving left on full equality is also needed by
+page deletion, which re-finds a leaf page by descending the tree while
+searching on the leaf page's high key.  If we wanted to avoid moving
+left without breaking page deletion, we'd have to avoid suffix
+truncation, which could never be worth it.
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -598,33 +615,60 @@ the order of multiple keys for a given column is unspecified.)  An
 insertion scankey uses the same array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is exactly one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.  A truncated tuple logically retains all key attributes, though
+they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -642,9 +686,10 @@ so we have to play some games.
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so data
+items start with the first item.  Putting the high key at the left, rather
+than the right, may seem odd, but it avoids moving the high key as we add
+data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
@@ -658,4 +703,17 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..77bc6ee9b3 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -52,19 +52,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber  _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool restorebinsrch,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_findinsertrandom(Relation rel, Relation heapRel, Buffer buf,
+				  bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -72,7 +72,7 @@ static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   bool split_only_page);
 static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+		  IndexTuple newitem, bool newitemonleft, bool truncate);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -84,8 +84,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			 Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -111,18 +111,21 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	Assert(IndexRelationGetNumberOfKeyAttributes(rel) != 0);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_scankey = _bt_mkscankey(rel, itup, false);
+top:
+	/* Cannot use real heap TID in unique case -- it'll be restored later */
+	if (itup_scankey->uniquekeys && checkUnique != UNIQUE_CHECK_NO)
+		itup_scankey->scantid = _bt_lowest_scantid();
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -143,14 +146,10 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 * other backend might be concurrently inserting into the page, thus
 	 * reducing our chances to finding an insertion place in this page.
 	 */
-top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -174,14 +173,17 @@ top:
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
 			 * free space to accommodate the new tuple, and the insertion scan
-			 * key is strictly greater than the first key on the page.
+			 * key is strictly greater than the first key on the page.  The
+			 * itup_scankey scantid value may prevent the optimization from
+			 * being applied despite being safe when it was temporarily set to
+			 * a sentinel low value, though only when the page is full of
+			 * duplicates.
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +222,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +232,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -249,9 +251,24 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * Arrange for the later _bt_findinsertloc call to _bt_binsrch to
+		 * avoid repeating the work done during this initial _bt_binsrch call.
+		 * Clear the _bt_lowest_scantid-supplied scantid value first, though,
+		 * so that the itup_scankey-cached low and high bounds will enclose a
+		 * range of offsets in the event of multiple duplicates. (Our
+		 * _bt_binsrch call cannot be allowed to incorrectly enclose a single
+		 * offset: the offset of the first duplicate among many on the page.)
+		 */
+		itup_scankey->scantid = NULL;
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -274,10 +291,16 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_scankey->uniquekeys)
+			itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber	insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -288,10 +311,11 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		/* do the insertion, possibly on a page to the right in unique case */
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf,
+									  checkUnique != UNIQUE_CHECK_NO, itup,
+									  stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -302,7 +326,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -327,13 +351,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -344,6 +367,10 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
 
+	/* Fast path for case where there are clearly no duplicates */
+	if (itup_scankey->low == itup_scankey->high)
+		return InvalidTransactionId;
+
 	InitDirtySnapshot(SnapshotDirty);
 
 	page = BufferGetPage(buf);
@@ -392,7 +419,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -553,11 +580,29 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/*
+			 * If scankey <= hikey (leaving out the heap TID attribute), we
+			 * gotta check the next page too.
+			 *
+			 * We cannot get away with giving up without going to the next
+			 * page when true key values are all == hikey, because heap TID is
+			 * ignored when considering duplicates (caller is sure to not
+			 * provide a scantid in scankey).  We could get away with this in
+			 * a hypothetical world where unique indexes certainly never
+			 * contain physical duplicates, since heap TID would never be
+			 * treated as part of the keyspace --- not here, and not at any
+			 * other point.
+			 *
+			 * If we end up moving right when scankey == hikey, then in
+			 * practice there is a very strong chance that visiting the next
+			 * page will find duplicates that need to be checked.  Caller's
+			 * _bt_lowest_scantid() optimization already eliminates all cases
+			 * where visiting an extra leaf page is truly unnecessary.
+			 */
+			Assert(itup_scankey->scantid == NULL);
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			if (_bt_compare(rel, itup_scankey, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -599,40 +644,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessarily.  (Doesn't seem worthwhile to
+ *		optimize this further.)
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate caller's
+ *		restorebinsrch hint.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool restorebinsrch,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
@@ -641,91 +686,55 @@ _bt_findinsertloc(Relation rel,
 	Page		page = BufferGetPage(buf);
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
+
+	Assert(!itup_scankey->uniquekeys || itup_scankey->scantid != NULL);
+	Assert(itup_scankey->uniquekeys || itup_scankey->scantid == NULL);
+	Assert(itup_scankey->scantid == NULL ||
+		   ItemPointerCompare(itup_scankey->scantid, _bt_lowest_scantid()) > 0);
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-
 	itemsz = IndexTupleSize(newtup);
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, page, newtup);
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
+		int			cmpval;
 
-		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
-		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+		Assert(P_ISLEAF(lpageop));
+
+		if (P_RIGHTMOST(lpageop))
+		   break;
+
+		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+		if (itup_scankey->uniquekeys)
+		{
+			/* Version 4 -- handle possible concurrent page splits */
+			if (cmpval <= 0)
+				break;
+		}
+		else
 		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
 			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
+			 * Version 2 or 3 -- handle possible concurrent page splits, and
+			 * case when there are many duplicates, and there is a choice of
+			 * which page to place new tuple on
 			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
+			if (cmpval != 0 || _bt_findinsertrandom(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
 		}
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
-			break;
-
-		/*
-		 * step right to next non-dead page
+		 * step right to next non-dead page.  this is only needed for unique
+		 * indexes, and pg_upgrade'd indexes that still use BTREE_VERSION 2 or
+		 * 3, where heap TID isn't considered to be a part of the keyspace.
 		 *
 		 * must write-lock that page before releasing write lock on current
 		 * page; else someone else's _bt_check_unique scan could fail to see
@@ -764,27 +773,79 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform micro-vacuuming of the page we're about to insert tuple on if
+	 * it looks like it has LP_DEAD items.  Only micro-vacuum when it might
+	 * forestall a page split, though. (Micro-vacuuming occasionally fails to
+	 * prevent a split, since we're not guaranteed to free more space than
+	 * what will be needed for our single new tuple.)
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+	Assert(!itup_scankey->restorebinsrch);
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ * If we will need to split the page to put the item on this page, check
+ * whether we can put the tuple somewhere to the right, instead.  Keep
+ * scanning right until we (a) find a page with enough free space, (b) reach
+ * the last page where the tuple can legally go, or (c) get tired of
+ * searching.  (c) is not flippant; it is important because if there are many
+ * pages' worth of equal keys, it's better to split one of the early pages
+ * than to scan all the way to the end of the run of equal keys on every
+ * insert.  We implement "get tired" as a random choice, since stopping after
+ * scanning a fixed number of pages wouldn't work well (we'd never reach the
+ * right-hand side of previously split pages).  Currently the probability of
+ * moving right is set at 0.99, which may seem too high to change the behavior
+ * much, but it does an excellent job of preventing O(N^2) behavior with many
+ * equal keys.
+ *
+ * Returns value indicating if caller should insert on candidate leaf page.
+ */
+static bool
+_bt_findinsertrandom(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop));
+
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	/*
+	 * before considering moving right, see if we can obtain enough space
+	 * by erasing LP_DEAD items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;	/* OK, now we have enough space */
+	}
+
+	/* "Get tired" at random */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -833,6 +894,8 @@ _bt_insertonpg(Relation rel,
 	BTPageOpaque lpageop;
 	OffsetNumber firstright = InvalidOffsetNumber;
 	Size		itemsz;
+	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -840,12 +903,9 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
-	Assert(!P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
-	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
+	Assert(!P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) == indnatts);
+	Assert(P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) <= indnkeyatts);
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -867,6 +927,7 @@ _bt_insertonpg(Relation rel,
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
+		bool		truncate;
 		bool		newitemonleft;
 		Buffer		rbuf;
 
@@ -893,9 +954,16 @@ _bt_insertonpg(Relation rel,
 									  newitemoff, itemsz,
 									  &newitemonleft);
 
+		/*
+		 * Perform truncation of the new high key for the left half of the
+		 * split when splitting a leaf page.  Don't do so with version 3
+		 * indexes unless the index has non-key attributes.
+		 */
+		truncate = P_ISLEAF(lpageop) &&
+				   (_bt_hasuniquekeys(rel) || indnatts != indnkeyatts);
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+						 newitemoff, itemsz, itup, newitemonleft, truncate);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -977,7 +1045,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_META_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1032,6 +1100,9 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version == BTREE_META_VERSION ||
+					   metad->btm_version == BTREE_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1097,7 +1168,10 @@ _bt_insertonpg(Relation rel,
  *		On entry, buf is the page to split, and is pinned and write-locked.
  *		firstright is the item index of the first item to be moved to the
  *		new right page.  newitemoff etc. tell us about the new item that
- *		must be inserted along with the data from the old page.
+ *		must be inserted along with the data from the old page.  truncate
+ *		tells us if the new high key should undergo suffix truncation.
+ *		(Version 4 pivot tuples always have an explicit representation of
+ *		the number of non-truncated attributes that remain.)
  *
  *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
  *		page we're inserting the downlink for.  This function will clear the
@@ -1109,7 +1183,7 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+		  bool newitemonleft, bool truncate)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1132,8 +1206,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1275,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1291,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1311,58 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
 	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * though; truncating unneeded key suffix attributes can only be
+	 * performed at the leaf level anyway.  This is because a pivot tuple in
+	 * a grandparent page must guide a search not only to the correct parent
+	 * page, but also to the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (truncate)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber lastleftoff;
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		Assert(isleaf);
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, false);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1555,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1583,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1604,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -2164,7 +2257,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2199,7 +2292,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2232,6 +2326,9 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version == BTREE_META_VERSION ||
+			   metad->btm_version == BTREE_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2296,6 +2393,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2311,28 +2409,25 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..716f7c1f40 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *	3, the last version that can be updated without broadly affecting on-disk
+ *	compatibility.  (A REINDEX is required to upgrade to version 4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_META_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_META_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData   *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_META_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to BTREE_VERSION/version 4 without a
+		 * REINDEX, since extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_META_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,9 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version == BTREE_META_VERSION ||
+			   metad->btm_version == BTREE_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +418,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +442,9 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version == BTREE_META_VERSION ||
+				   metad->btm_version == BTREE_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +640,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +662,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_hasuniquekeys() -- Determine if heap TID should be treated as a key.
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_hasuniquekeys(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_META_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_META_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1370,7 +1441,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1420,12 +1491,19 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_scankey = _bt_mkscankey(rel, targetkey, false);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
+
+				/*
+				 * Prior to version 4, search is for the leftmost leaf page
+				 * containing this key, which is okay because we have to match
+				 * on block number to deal with concurrent splits anyway.
+				 * Otherwise, search will reliably relocate same leaf page.
+				 */
+				Assert(!itup_scankey->uniquekeys ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
@@ -1970,7 +2048,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2018,6 +2096,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version == BTREE_META_VERSION ||
+				   metad->btm_version == BTREE_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..9cf760ffa0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 16223d01ec..7305e647b2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,10 @@
 #include "utils/tqual.h"
 
 
+static inline int32 _bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -38,6 +42,7 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+static ItemPointerData lowest;
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -72,12 +77,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.  If key was built
+ * using a leaf high key, leaf page will be relocated.
  *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
@@ -94,8 +96,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +147,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +160,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * faster than comparing the keys themselves.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,8 +201,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -216,16 +218,16 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the key.nextkey=true
  * case, then we followed the wrong link and we need to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -242,10 +244,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -270,7 +270,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -284,7 +284,7 @@ _bt_moveright(Relation rel,
 		/*
 		 * Finish any incomplete splits we encounter along the way.
 		 */
-		if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+		if (unlikely(forupdate && P_INCOMPLETE_SPLIT(opaque)))
 		{
 			BlockNumber blkno = BufferGetBlockNumber(buf);
 
@@ -305,7 +305,8 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (unlikely(P_IGNORE(opaque) ||
+					 _bt_compare(rel, key, page, P_HIKEY) >= cmpval))
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -328,10 +329,6 @@ _bt_moveright(Relation rel,
  * The passed scankey must be an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,37 +344,76 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				savehigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	isleaf = P_ISLEAF(opaque);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state when scantid is available */
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->uniquekeys || !key->restorebinsrch || key->scantid != NULL);
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available
+		 * slot.  Note this covers two cases: the page is really empty (no
+		 * keys), or it contains only a high key.  The latter case is
+		 * possible after vacuuming.  This can never happen on an internal
+		 * page, however, since they are never empty (an internal page must
+		 * have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				key->low = low;
+				key->high = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;			/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->high;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (unlikely(high < low))
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -391,22 +427,40 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	savehigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		if (!isleaf)
+			result = _bt_compare(rel, key, page, mid);
+		else
+			result = _bt_nonpivot_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				savehigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->high = savehigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -421,7 +475,8 @@ _bt_binsrch(Relation rel,
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
-	 * There must be one if _bt_compare() is playing by the rules.
+	 * There must be one if _bt_compare()/_bt_tuple_compare() is playing by
+	 * the rules.
 	 */
 	Assert(low > P_FIRSTDATAKEY(opaque));
 
@@ -431,21 +486,11 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
- *		This routine returns:
- *			<0 if scankey < tuple at offnum;
- *			 0 if scankey == tuple at offnum;
- *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
- *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored (if any, which there probably isn't)
@@ -456,26 +501,82 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
-	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	int			ntupatts;
 
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
+	return _bt_tuple_compare(rel, key, itup, ntupatts);
+}
+
+/*
+ * Optimized version of _bt_compare().  Only works on non-pivot tuples.
+ */
+static inline int32
+_bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum)
+{
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(BTreeTupleGetNAtts(itup, rel) ==
+		   IndexRelationGetNumberOfAttributes(rel));
+	return _bt_tuple_compare(rel, key, itup, key->keysz);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
+ * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ *		This routine returns:
+ *			<0 if scankey < tuple;
+ *			 0 if scankey == tuple;
+ *			>0 if scankey > tuple.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
+ *----------
+ */
+int32
+_bt_tuple_compare(Relation rel,
+				  BTScanInsert key,
+				  IndexTuple itup,
+				  int ntupatts)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ItemPointer heapTid;
+	int			ncmpkey;
+	int			i;
+	ScanKey		scankey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->uniquekeys || key->scantid == NULL);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +590,9 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, key->keysz);
+	scankey = key->scankeys;
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +643,83 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (key->scantid == NULL)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (heapTid == NULL)
+		return 1;
+
+	return ItemPointerCompare(key->scantid, heapTid);
+}
+
+/*
+ * _bt_lowest_scantid() -- Manufacture low heap TID.
+ *
+ *		Create a heap TID that's strictly less than any possible real heap
+ *		TID to _bt_tuple_compare.  This is still treated as greater than
+ *		minus infinity.  The overall effect is that _bt_search follows
+ *		downlinks with scankey equal non-TID attribute(s), but a
+ *		truncated-away TID attribute, as the scankey is greater than the
+ *		downlink/pivot tuple as a whole. (Obviously this can only be of use
+ *		when a scankey has values for all key attributes other than the heap
+ *		TID tie-breaker attribute/scantid.)
+ *
+ * If we didn't do this then affected index scans would have to
+ * unnecessarily visit an extra page before moving right to the page they
+ * should have landed on from the parent in the first place.  There would
+ * even be a useless binary search on the left/first page, since a high key
+ * check won't have the search move right immediately (the high key will be
+ * identical to the downlink we should have followed in the parent, barring
+ * a concurrent page split).
+ *
+ * This is particularly important with unique index insertions, since "the
+ * first page the value could be on" has an exclusive buffer lock held while
+ * a subsequent page (usually the actual first page the value could be on)
+ * has a shared buffer lock held.  (There may also be heap buffer locks
+ * acquired during this process.)
+ *
+ * Note that implementing this by adding hard-coding to _bt_compare is
+ * unworkable, since that would break nextkey semantics in the common case
+ * where all non-TID key attributes have been provided.
+ */
+ItemPointer
+_bt_lowest_scantid(void)
+{
+	static ItemPointer low = NULL;
+
+	/*
+	 * A heap TID that's less than or equal to any possible real heap TID
+	 * would also work.  Generating an impossibly-low TID value seems
+	 * slightly simpler.
+	 */
+	if (!low)
+	{
+		low = &lowest;
+
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(low, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(low, InvalidOffsetNumber);
+	}
+
+	return low;
 }
 
 /*
@@ -575,8 +753,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData key;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ScanKey		scankeys;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -822,10 +1001,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built in the scankeys[]
+	 * array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scankeys = key.scankeys;
+
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -1053,12 +1234,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/*
+	 * Initialize insertion scankey.
+	 *
+	 * Manufacture sentinel scan tid that's less than any possible heap TID
+	 * in the index when that might allow us to avoid unnecessary moves
+	 * right while descending the tree.
+	 *
+	 * Never do this for any nextkey case, since that would make
+	 * _bt_search() incorrectly land on the leaf page with the second
+	 * user-attribute-wise duplicate tuple, rather than landing on the leaf
+	 * page with the next user-attribute-distinct key > scankey, which is
+	 * the intended behavior.  We could invent a _bt_highest_scantid() to
+	 * use in nextkey cases, but that would never actually save any cycles
+	 * during the descent of the tree; "_bt_binsrch() + nextkey = true"
+	 * already behaves as if all tuples <= scankey (in terms of the
+	 * attributes/keys actually supplied in the scankey) are < scankey.
+	 */
+	key.uniquekeys = _bt_hasuniquekeys(rel);
+	key.savebinsrch = key.restorebinsrch = false;
+	key.low = key.high = 0;
+	key.nextkey = nextkey;
+	key.keysz = keysCount;
+	key.scantid = NULL;
+	if (key.keysz >= IndexRelationGetNumberOfKeyAttributes(rel) &&
+		!key.nextkey && key.uniquekeys)
+		key.scantid = _bt_lowest_scantid();
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &key, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1294,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &key, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..3922d9252b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -743,6 +743,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -796,8 +797,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -813,28 +812,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = IndexTupleSize(itup);
 	itupsz = MAXALIGN(itupsz);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
-	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -880,19 +859,30 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it
+			 * more likely that _bt_truncate() can truncate away more
+			 * attributes, whereas the split point passed by _bt_split() is
+			 * chosen much more delicately.  Suffix truncation is mostly
+			 * useful because it can greatly improve space utilization for
+			 * workloads with random insertions, or insertions of
+			 * monotonically increasing values at "local" points in the key
+			 * space.  It doesn't seem worthwhile to add complex logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +895,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup, true);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +917,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +964,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1023,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1115,7 +1110,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
+		pfree(indexScanKey);
 
 		for (;;)
 		{
@@ -1127,6 +1122,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1131,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1147,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 205457ef99..83298ff257 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+				IndexTuple firstright, bool build);
 
 
 /*
@@ -56,34 +58,56 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one at some point if the pivot tuple actually came
+ *		from the tree, barring the minus infinity special case).
+ *
+ *		Note that we may occasionally have to share lock the metapage, in
+ *		order to determine whether or not the keys in the index are expected
+ *		to be unique (i.e. whether or not heap TID is treated as a tie-breaker
+ *		attribute).  Callers that cannot tolerate this can request that we
+ *		assume that all entries in the index are unique.
+ *
  *		The result is intended for use with _bt_compare().
  */
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeunique)
 {
+	BTScanInsert res;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	res = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	res->uniquekeys = assumeunique || _bt_hasuniquekeys(rel);
+	res->savebinsrch = res->restorebinsrch = false;
+	res->low = res->high = 0;
+	res->nextkey = false;
+	res->keysz = Min(indnkeyatts, tupnatts);
+	res->scantid = res->uniquekeys ? BTreeTupleGetHeapTID(itup) : NULL;
+	skey = res->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +120,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple due
+		 * to suffix truncation.  Keys built from truncated attributes are
+		 * defensively represented as NULL values, though they should still
+		 * not participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,7 +145,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
+	return res;
 }
 
 /*
@@ -159,15 +196,6 @@ _bt_mkscankey_nodata(Relation rel)
 	return skey;
 }
 
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
-}
-
 /*
  * free a retracement stack made by _bt_search.
  */
@@ -2083,38 +2111,218 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to insertion scan key
+ * semantics).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
- * Note that returned tuple's t_tid offset will hold the number of attributes
- * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * Note that returned tuple's t_tid offset will hold the number of
+ * attributes present, so the original item pointer offset is not
+ * represented.  Caller should only change truncated tuple's downlink.  Note
+ * also that truncated key attributes are treated as containing "minus
+ * infinity" values by _bt_compare()/_bt_tuple_compare().
+ *
+ * Returned tuple is guaranteed to be no larger than the original plus some
+ * extra space for a possible extra heap TID tie-breaker attribute.  This
+ * guarantee is important for staying under the 1/3 of a page restriction on
+ * tuple size.
+ *
+ * CREATE INDEX callers must pass build = true, in order to avoid metapage
+ * access.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 bool build)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright, build);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where
+		 * we can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).  We should only call index_truncate_tuple() when non-TID
+		 * attributes need to be truncated.
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key
+		 * space, so it's still necessary to add a heap TID attribute to the
+		 * new pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since attributes are all equal.  It's
+		 * necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID attribute in the right item readily distinguishes the right
+	 * side of the split from the left side.  Use enlarged space that holds a
+	 * copy of first right tuple; place a heap TID value within the extra
+	 * space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 *
+	 * Callers generally try to avoid choosing a split point that necessitates
+	 * that we do this.  Splits of pages that only involve a single distinct
+	 * value (or set of values) must end up here, though.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Generate an artificial heap TID value that's immediately before the
+	 * first right item's heap TID.  The goal is to maximize the number of
+	 * future duplicates that will end up on the mostly-empty right side of
+	 * the split, while minimizing the number inserted on the mostly-full left
+	 * side.  (We expect a continual right-heavy split pattern.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on all current and future items on the right page.
+	 * That's why we didn't just directly copy the first right item's heap
+	 * TID.
+	 */
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  CREATE INDEX
+ * callers must pass build = true so that we may avoid metapage access.  (This
+ * is okay because CREATE INDEX always creates an index on the lastest btree
+ * version, where all keys are unique.)
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+				bool build)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	BTScanInsert key;
+
+	key = _bt_mkscankey(rel, firstright, build);
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!key->uniquekeys)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	key->scantid = NULL;
+
+	/*
+	 * Even test nkeyatts (no truncated non-TID attributes) case, since caller
+	 * cares about whether or not it can avoid appending a heap TID as a
+	 * unique-ifier
+	 */
+	leavenatts = 1;
+	for (;;)
+	{
+		if (leavenatts > nkeyatts)
+			break;
+		key->keysz = leavenatts;
+		if (_bt_tuple_compare(rel, key, lastleft, nkeyatts) > 0)
+			break;
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	pfree(key);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2345,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2365,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2375,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2386,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,8 +2419,73 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
 }
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, Page page, IndexTuple newtup)
+{
+	bool		needheaptidspace;
+	Size		itemsz;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, in which case a slightly higher
+	 * limit applies.
+	 */
+	needheaptidspace = _bt_hasuniquekeys(rel);
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	if (needheaptidspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+						itemsz, BTREE_VERSION, BTMaxItemSize(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
+						itemsz, BTMaxItemSizeNoHeapTid(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..fe8f4fe2a7 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 5909404e1e..3932d22b62 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ea495f1724..7baef7d685 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -97,7 +97,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 typedef struct BTMetaPageData
 {
 	uint32		btm_magic;		/* should contain BTREE_MAGIC */
-	uint32		btm_version;	/* should contain BTREE_VERSION */
+	uint32		btm_version;	/* should be >= BTREE_META_VERSION */
 	BlockNumber btm_root;		/* current root location */
 	uint32		btm_level;		/* tree level of the root page */
 	BlockNumber btm_fastroot;	/* current "fast" root location */
@@ -114,16 +114,27 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_META_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -204,21 +215,23 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.  A pivot tuple must have INDEX_ALT_TID_MASK set
+ * as of BTREE_VERSION 4, however.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits): BT_HEAP_TID_ATTR, plus 3 bits that are
+ * reserved for future use.  BT_N_KEYS_OFFSET_MASK should be large enough to
+ * store any number <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented at end of tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +254,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +271,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -319,6 +365,62 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples.  For details on its mutable
+ * state, see _bt_binsrch and _bt_findinsertloc.
+ *
+ * uniquekeys indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e.  the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * uniquekeys index.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search().
+ *
+ * keysz is the number of insertion scankeys present (scantid is counted
+ * separately).
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.   The array is
+ * used as a flexible array member, though it's sized in a way that makes it
+ * possible to use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state.  Used by _bt_binsrch() to inexpensively repeat a binary
+	 * search when only scantid has changed.
+	 */
+	bool		 savebinsrch;
+	bool		 restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* State used to find tuples on the leaf level */
+	bool		uniquekeys;
+	bool		nextkey;
+	ItemPointer scantid;						/* Not used in !uniquekeys case */
+	int			keysz;							/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];		/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -541,6 +643,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_hasuniquekeys(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -559,15 +662,15 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   BTScanInsert key,
 		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup,
+				 int ntupatts);
+extern ItemPointer _bt_lowest_scantid(void);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +679,9 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup,
+								  bool assumeunique);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -600,8 +703,11 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright, bool build);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
+					 IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..06da0965f7 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/modules/test_extensions/expected/test_extensions.out b/src/test/modules/test_extensions/expected/test_extensions.out
index 28d86c4b87..29b4ec95c1 100644
--- a/src/test/modules/test_extensions/expected/test_extensions.out
+++ b/src/test/modules/test_extensions/expected/test_extensions.out
@@ -30,10 +30,10 @@ NOTICE:  installing required extension "test_ext_cyclic2"
 ERROR:  cyclic dependency detected between extensions "test_ext_cyclic1" and "test_ext_cyclic2"
 DROP SCHEMA test_ext CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to extension test_ext3
-drop cascades to extension test_ext5
-drop cascades to extension test_ext2
+DETAIL:  drop cascades to extension test_ext5
 drop cascades to extension test_ext4
+drop cascades to extension test_ext3
+drop cascades to extension test_ext2
 drop cascades to extension test_ext1
 CREATE EXTENSION test_ext6;
 DROP EXTENSION test_ext6;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 717e965f30..e815a657b1 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -1070,9 +1070,9 @@ select distinct min(f1), max(f1) from minmaxtest;
 
 drop table minmaxtest cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table minmaxtest1
+DETAIL:  drop cascades to table minmaxtest3
 drop cascades to table minmaxtest2
-drop cascades to table minmaxtest3
+drop cascades to table minmaxtest1
 -- check for correct detection of nested-aggregate errors
 select max(min(unique1)) from tenk1;
 ERROR:  aggregate function calls cannot be nested
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 68cd3e5676..d2ece1355a 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2684,19 +2684,19 @@ select alter2.plus1(41);
 -- clean up
 drop schema alter2 cascade;
 NOTICE:  drop cascades to 13 other objects
-DETAIL:  drop cascades to table alter2.t1
-drop cascades to view alter2.v1
-drop cascades to function alter2.plus1(integer)
-drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
+DETAIL:  drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
-drop cascades to conversion alter2.ascii_to_utf8
-drop cascades to text search parser alter2.prs
-drop cascades to text search configuration alter2.cfg
-drop cascades to text search template alter2.tmpl
-drop cascades to text search dictionary alter2.dict
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to type alter2.posint
+drop cascades to function alter2.plus1(integer)
+drop cascades to table alter2.t1
+drop cascades to view alter2.v1
 --
 -- composite types
 --
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index fcbe3a5cc8..6a58a6ae8a 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -667,5 +667,6 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 -- must get rid of them.
 --
 \set VERBOSITY terse
+SET client_min_messages TO 'warning';
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+RESET client_min_messages;
diff --git a/src/test/regress/expected/create_type.out b/src/test/regress/expected/create_type.out
index 2f7d5f94d7..8309756030 100644
--- a/src/test/regress/expected/create_type.out
+++ b/src/test/regress/expected/create_type.out
@@ -161,13 +161,13 @@ DROP FUNCTION base_fn_out(opaque); -- error
 ERROR:  function base_fn_out(opaque) does not exist
 DROP TYPE base_type; -- error
 ERROR:  cannot drop type base_type because other objects depend on it
-DETAIL:  function base_fn_out(base_type) depends on type base_type
-function base_fn_in(cstring) depends on type base_type
+DETAIL:  function base_fn_in(cstring) depends on type base_type
+function base_fn_out(base_type) depends on type base_type
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE base_type CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to function base_fn_out(base_type)
-drop cascades to function base_fn_in(cstring)
+DETAIL:  drop cascades to function base_fn_in(cstring)
+drop cascades to function base_fn_out(base_type)
 -- Check usage of typmod with a user-defined type
 -- (we have borrowed numeric's typmod functions)
 CREATE TEMP TABLE mytab (foo widget(42,13,7));     -- should fail
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a9041b0..f4899f2a38 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -643,10 +643,10 @@ update domnotnull set col1 = null; -- fails
 ERROR:  domain dnotnulltest does not allow null values
 alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+\set VERBOSITY default
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..0755931db8 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
@@ -276,14 +276,13 @@ CREATE EVENT TRIGGER regress_event_trigger_drop_objects ON sql_drop
 ALTER TABLE schema_one.table_one DROP COLUMN a;
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
 ERROR:  object audit_tbls.schema_two_table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
@@ -292,61 +291,61 @@ PL/pgSQL function test_evtrig_dropped_objects() line 8 at EXECUTE
 DELETE FROM undroppable_objs WHERE object_identity = 'audit_tbls.schema_two_table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 ERROR:  object schema_one.table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
 DELETE FROM undroppable_objs WHERE object_identity = 'schema_one.table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 SELECT * FROM dropped_objects WHERE schema IS NULL OR schema <> 'pg_toast';
      type     |   schema   |               object                
 --------------+------------+-------------------------------------
  table column | schema_one | schema_one.table_one.a
  schema       |            | schema_two
- table        | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two[]
+ function     | schema_two | schema_two.add(integer,integer)
+ aggregate    | schema_two | schema_two.newton(integer)
  table        | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three[]
  table        | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three[]
- function     | schema_two | schema_two.add(integer,integer)
- aggregate    | schema_two | schema_two.newton(integer)
+ table        | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two[]
  schema       |            | schema_one
- table        | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one[]
- table        | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"[]
  table        | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three[]
+ table        | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"[]
+ table        | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one[]
 (23 rows)
 
 DROP OWNED BY regress_evt_user;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 75365501d4..1171a8865f 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -2060,9 +2060,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index fee594531d..0e1f7ad7ea 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -253,13 +253,13 @@ SELECT * FROM FKTABLE;
 (5 rows)
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 --
 -- First test, check with no on delete or on update
 --
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index d768e5df2c..007ad3738d 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -952,8 +952,8 @@ Inherits: c1,
 
 drop table p1 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table c1
-drop cascades to table c2
+DETAIL:  drop cascades to table c2
+drop cascades to table c1
 drop cascades to table c3
 drop table p2 cascade;
 create table pp1 (f1 int);
@@ -1098,9 +1098,9 @@ SELECT a.attrelid::regclass, a.attname, a.attinhcount, e.expected
 
 DROP TABLE inht1, inhs1 CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table inht2
+DETAIL:  drop cascades to table inht3
+drop cascades to table inht2
 drop cascades to table inhts
-drop cascades to table inht3
 drop cascades to table inht4
 -- Test non-inheritable indices [UNIQUE, EXCLUDE] constraints
 CREATE TABLE test_constraints (id int, val1 varchar, val2 int, UNIQUE(val1, val2));
@@ -1352,8 +1352,8 @@ select * from patest0 join (select f1 from int4_tbl limit 1) ss on id = f1;
 
 drop table patest0 cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to table patest1
-drop cascades to table patest2
+DETAIL:  drop cascades to table patest2
+drop cascades to table patest1
 --
 -- Test merge-append plans for inheritance trees
 --
@@ -1499,9 +1499,9 @@ reset enable_seqscan;
 reset enable_parallel_append;
 drop table matest0 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table matest1
+DETAIL:  drop cascades to table matest3
 drop cascades to table matest2
-drop cascades to table matest3
+drop cascades to table matest1
 --
 -- Check that use of an index with an extraneous column doesn't produce
 -- a plan with extraneous sorting
diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index 08cd4bea48..dc7878454d 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -310,15 +310,15 @@ SELECT type, m.totamt AS mtot, v.totamt AS vtot FROM mvtest_tm m LEFT JOIN mvtes
 -- make sure that dependencies are reported properly when they block the drop
 DROP TABLE mvtest_t;
 ERROR:  cannot drop table mvtest_t because other objects depend on it
-DETAIL:  view mvtest_tv depends on table mvtest_t
+DETAIL:  materialized view mvtest_tm depends on table mvtest_t
+materialized view mvtest_tmm depends on materialized view mvtest_tm
+view mvtest_tv depends on table mvtest_t
 view mvtest_tvv depends on view mvtest_tv
 materialized view mvtest_tvvm depends on view mvtest_tvv
 view mvtest_tvvmv depends on materialized view mvtest_tvvm
 materialized view mvtest_bb depends on view mvtest_tvvmv
 materialized view mvtest_mvschema.mvtest_tvm depends on view mvtest_tv
 materialized view mvtest_tvmm depends on materialized view mvtest_mvschema.mvtest_tvm
-materialized view mvtest_tm depends on table mvtest_t
-materialized view mvtest_tmm depends on materialized view mvtest_tm
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- make sure dependencies are dropped and reported
 -- and make sure that transactional behavior is correct on rollback
@@ -326,15 +326,15 @@ HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 BEGIN;
 DROP TABLE mvtest_t CASCADE;
 NOTICE:  drop cascades to 9 other objects
-DETAIL:  drop cascades to view mvtest_tv
+DETAIL:  drop cascades to materialized view mvtest_tm
+drop cascades to materialized view mvtest_tmm
+drop cascades to view mvtest_tv
 drop cascades to view mvtest_tvv
 drop cascades to materialized view mvtest_tvvm
 drop cascades to view mvtest_tvvmv
 drop cascades to materialized view mvtest_bb
 drop cascades to materialized view mvtest_mvschema.mvtest_tvm
 drop cascades to materialized view mvtest_tvmm
-drop cascades to materialized view mvtest_tm
-drop cascades to materialized view mvtest_tmm
 ROLLBACK;
 -- some additional tests not using base tables
 CREATE VIEW mvtest_vt1 AS SELECT 1 moo;
@@ -484,10 +484,10 @@ SELECT * FROM mvtest_mv_v_4;
 
 DROP TABLE mvtest_v CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to materialized view mvtest_mv_v
-drop cascades to materialized view mvtest_mv_v_2
+DETAIL:  drop cascades to materialized view mvtest_mv_v_4
 drop cascades to materialized view mvtest_mv_v_3
-drop cascades to materialized view mvtest_mv_v_4
+drop cascades to materialized view mvtest_mv_v_2
+drop cascades to materialized view mvtest_mv_v
 -- Check that unknown literals are converted to "text" in CREATE MATVIEW,
 -- so that we don't end up with unknown-type columns.
 CREATE MATERIALIZED VIEW mv_unspecified_types AS
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index bc16ca4c43..d91f6305f6 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/expected/select_into.out b/src/test/regress/expected/select_into.out
index 942f975e95..30e10b12d2 100644
--- a/src/test/regress/expected/select_into.out
+++ b/src/test/regress/expected/select_into.out
@@ -46,9 +46,9 @@ CREATE TABLE selinto_schema.tmp3 (a,b,c)
 RESET SESSION AUTHORIZATION;
 DROP SCHEMA selinto_schema CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table selinto_schema.tmp1
+DETAIL:  drop cascades to table selinto_schema.tmp3
 drop cascades to table selinto_schema.tmp2
-drop cascades to table selinto_schema.tmp3
+drop cascades to table selinto_schema.tmp1
 DROP USER regress_selinto_user;
 -- Tests for WITH NO DATA and column name consistency
 CREATE TABLE ctas_base (i int, j int);
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index 7d59de98eb..70b7c6eead 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -568,9 +568,9 @@ LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
 -- check dependency restrictions
 ALTER TABLE main_table DROP COLUMN b;
 ERROR:  cannot drop column b of table main_table because other objects depend on it
-DETAIL:  trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+DETAIL:  trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
 trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
-trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- this should succeed, but we'll roll it back to keep the triggers around
 begin;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 2e26510522..c8b9a71689 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -276,11 +276,10 @@ SELECT * FROM trunc_faa;
 (0 rows)
 
 ROLLBACK;
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
-drop cascades to table trunc_faa
-drop cascades to table trunc_fb
+\set VERBOSITY default
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..c76efee358 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -75,19 +75,12 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 ERROR:  column "name" specified more than once
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
-table persons2 depends on type person_type
-table persons3 depends on type person_type
-HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
-drop cascades to table persons2
-drop cascades to table persons3
+\set VERBOSITY default
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index e64d693e9c..8eca01a8e7 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
@@ -1054,10 +1040,10 @@ SELECT * FROM base_tbl;
 (2 rows)
 
 RESET SESSION AUTHORIZATION;
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- nested-view permissions
 CREATE TABLE base_tbl(a int, b text, c float);
 INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
@@ -1178,10 +1164,10 @@ ERROR:  permission denied for table base_tbl
 UPDATE rw_view2 SET b = 'bar' WHERE a = 1;  -- not allowed
 ERROR:  permission denied for table base_tbl
 RESET SESSION AUTHORIZATION;
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 DROP USER regress_view_user1;
 DROP USER regress_view_user2;
 -- column defaults
@@ -1439,11 +1425,10 @@ SELECT events & 4 != 0 AS upd,
  f   | f   | t
 (1 row)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
-drop cascades to view rw_view3
+\set VERBOSITY default
 -- inheritance tests
 CREATE TABLE base_tbl_parent (a int);
 CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
@@ -1540,10 +1525,10 @@ SELECT * FROM base_tbl_child ORDER BY a;
  20
 (6 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- simple WITH CHECK OPTION
 CREATE TABLE base_tbl (a int, b int DEFAULT 10);
 INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
@@ -1711,10 +1696,10 @@ SELECT * FROM base_tbl;
   30
 (3 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- WITH CHECK OPTION with no local view qual
 CREATE TABLE base_tbl (a int);
 CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WITH CHECK OPTION;
@@ -1740,11 +1725,10 @@ INSERT INTO rw_view3 VALUES (-3); -- should fail
 ERROR:  new row violates check option for view "rw_view2"
 DETAIL:  Failing row contains (-3).
 INSERT INTO rw_view3 VALUES (3); -- ok
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
-drop cascades to view rw_view3
+\set VERBOSITY default
 -- WITH CHECK OPTION with scalar array ops
 CREATE TABLE base_tbl (a int, b int[]);
 CREATE VIEW rw_view1 AS SELECT * FROM base_tbl WHERE a = ANY (b)
@@ -1911,10 +1895,10 @@ SELECT * FROM base_tbl;
   -5 | 10
 (7 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 DROP FUNCTION rw_view1_trig_fn();
 CREATE TABLE base_tbl (a int);
 CREATE VIEW rw_view1 AS SELECT a,10 AS b FROM base_tbl;
@@ -1923,10 +1907,10 @@ CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
 CREATE VIEW rw_view2 AS
   SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
 INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- security barrier view
 CREATE TABLE base_tbl (person text, visibility text);
 INSERT INTO base_tbl VALUES ('Tom', 'public'),
@@ -2111,10 +2095,10 @@ EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
          Filter: ((visibility = 'public'::text) AND snoop(person) AND (NOT snoop(person)))
 (3 rows)
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+\set VERBOSITY default
 -- security barrier view on top of table with rules
 CREATE TABLE base_tbl(id int PRIMARY KEY, data text, deleted boolean);
 INSERT INTO base_tbl VALUES (1, 'Row 1', false), (2, 'Row 2', true);
diff --git a/src/test/regress/output/tablespace.source b/src/test/regress/output/tablespace.source
index fe3614cd76..9b2e95973d 100644
--- a/src/test/regress/output/tablespace.source
+++ b/src/test/regress/output/tablespace.source
@@ -242,10 +242,10 @@ NOTICE:  no matching relations in tablespace "regress_tblspace_renamed" found
 DROP TABLESPACE regress_tblspace_renamed;
 DROP SCHEMA testschema CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to table testschema.foo
-drop cascades to table testschema.asselect
-drop cascades to table testschema.asexecute
+DETAIL:  drop cascades to table testschema.tablespace_acl
 drop cascades to table testschema.atable
-drop cascades to table testschema.tablespace_acl
+drop cascades to table testschema.asexecute
+drop cascades to table testschema.asselect
+drop cascades to table testschema.foo
 DROP ROLE regress_tablespace_user1;
 DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index 4ddde95a5e..94ef4e277e 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -260,4 +260,6 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 -- must get rid of them.
 --
 \set VERBOSITY terse
+SET client_min_messages TO 'warning';
 DROP SCHEMA collate_tests CASCADE;
+RESET client_min_messages;
diff --git a/src/test/regress/sql/domain.sql b/src/test/regress/sql/domain.sql
index 68da27de22..d19e2c9d28 100644
--- a/src/test/regress/sql/domain.sql
+++ b/src/test/regress/sql/domain.sql
@@ -381,7 +381,9 @@ alter domain dnotnulltest drop not null;
 
 update domnotnull set col1 = null;
 
+\set VERBOSITY terse
 drop domain dnotnulltest cascade;
+\set VERBOSITY default
 
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
diff --git a/src/test/regress/sql/foreign_key.sql b/src/test/regress/sql/foreign_key.sql
index 068ab2aab7..591916871a 100644
--- a/src/test/regress/sql/foreign_key.sql
+++ b/src/test/regress/sql/foreign_key.sql
@@ -159,9 +159,11 @@ UPDATE PKTABLE SET ptest1=1 WHERE ptest1=2;
 SELECT * FROM FKTABLE;
 
 -- this should fail for lack of CASCADE
+\set VERBOSITY terse
 DROP TABLE PKTABLE;
 DROP TABLE PKTABLE CASCADE;
 DROP TABLE FKTABLE;
+\set VERBOSITY default
 
 
 --
diff --git a/src/test/regress/sql/truncate.sql b/src/test/regress/sql/truncate.sql
index 6ddfb6dd1d..fee7e76ec3 100644
--- a/src/test/regress/sql/truncate.sql
+++ b/src/test/regress/sql/truncate.sql
@@ -125,7 +125,9 @@ SELECT * FROM trunc_fa;
 SELECT * FROM trunc_faa;
 ROLLBACK;
 
+\set VERBOSITY terse
 DROP TABLE trunc_f CASCADE;
+\set VERBOSITY default
 
 -- Test ON TRUNCATE triggers
 
diff --git a/src/test/regress/sql/typed_table.sql b/src/test/regress/sql/typed_table.sql
index 9ef0cdfcc7..953cd1f14b 100644
--- a/src/test/regress/sql/typed_table.sql
+++ b/src/test/regress/sql/typed_table.sql
@@ -43,8 +43,10 @@ CREATE TABLE persons4 OF person_type (
     name WITH OPTIONS DEFAULT ''  -- error, specified more than once
 );
 
+\set VERBOSITY terse
 DROP TYPE person_type RESTRICT;
 DROP TYPE person_type CASCADE;
+\set VERBOSITY default
 
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index dc6d5cbe35..9103793ff4 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
@@ -457,7 +459,9 @@ DELETE FROM rw_view2 WHERE aa=4; -- not allowed
 SELECT * FROM base_tbl;
 RESET SESSION AUTHORIZATION;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- nested-view permissions
 
@@ -533,7 +537,9 @@ UPDATE rw_view2 SET b = 'bar' WHERE a = 1;  -- not allowed
 
 RESET SESSION AUTHORIZATION;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 DROP USER regress_view_user1;
 DROP USER regress_view_user2;
@@ -678,7 +684,9 @@ SELECT events & 4 != 0 AS upd,
        events & 16 != 0 AS del
   FROM pg_catalog.pg_relation_is_updatable('rw_view3'::regclass, false) t(events);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- inheritance tests
 
@@ -710,7 +718,9 @@ DELETE FROM ONLY rw_view2 WHERE a IN (-8, 8); -- Should delete -8 only
 SELECT * FROM ONLY base_tbl_parent ORDER BY a;
 SELECT * FROM base_tbl_child ORDER BY a;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
+\set VERBOSITY default
 
 -- simple WITH CHECK OPTION
 
@@ -772,7 +782,9 @@ SELECT * FROM information_schema.views WHERE table_name = 'rw_view2';
 INSERT INTO rw_view2 VALUES (30); -- ok, but not in view
 SELECT * FROM base_tbl;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- WITH CHECK OPTION with no local view qual
 
@@ -790,7 +802,9 @@ INSERT INTO rw_view2 VALUES (2); -- ok
 INSERT INTO rw_view3 VALUES (-3); -- should fail
 INSERT INTO rw_view3 VALUES (3); -- ok
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- WITH CHECK OPTION with scalar array ops
 
@@ -918,7 +932,9 @@ INSERT INTO rw_view2 VALUES (5); -- ok
 UPDATE rw_view2 SET a = -5 WHERE a = 5; -- ok, but not in view (doesn't fail rw_view2's check)
 SELECT * FROM base_tbl;
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP FUNCTION rw_view1_trig_fn();
 
 CREATE TABLE base_tbl (a int);
@@ -928,7 +944,9 @@ CREATE RULE rw_view1_ins_rule AS ON INSERT TO rw_view1
 CREATE VIEW rw_view2 AS
   SELECT * FROM rw_view1 WHERE a > b WITH LOCAL CHECK OPTION;
 INSERT INTO rw_view2 VALUES (2,3); -- ok, but not in view (doesn't fail rw_view2's check)
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- security barrier view
 
@@ -1012,7 +1030,9 @@ EXPLAIN (costs off) SELECT * FROM rw_view2 WHERE snoop(person);
 EXPLAIN (costs off) UPDATE rw_view2 SET person=person WHERE snoop(person);
 EXPLAIN (costs off) DELETE FROM rw_view2 WHERE NOT snoop(person);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 
 -- security barrier view on top of table with rules
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..08cf72d670 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -167,6 +167,8 @@ BTArrayKeyInfo
 BTBuildState
 BTCycleId
 BTIndexStat
+BTInsertionKey
+BTInsertionKeyData
 BTLeader
 BTMetaPageData
 BTOneVacInfo
@@ -2207,6 +2209,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

#40

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#39)

6 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Attached is v8 of the patch series, which has some relatively minor changes:

* A new commit adds an artificial tie-breaker column to pg_depend
indexes, comprehensively solving the issues with regression test
instability. This is the only really notable change.

* Clean-up of how the design in described in the nbtree README, and
elsewhere. I want to make it clear that we're now more or less using
the Lehman and Yao design. I re-read the Lehman and Yao paper to make
sure that the patch acknowledges what Lehman and Yao say to expect, at
least in cases that seemed to matter.

* Stricter verification by contrib/amcheck. Not likely to catch a case
that wouldn't have been caught by previous revisions, but should make
the design a bit clearer to somebody following L&Y.

* Tweaks to how _bt_findsplitloc() accumulates candidate split points.
We're less aggressive in choosing a smaller tuple during an internal
page split in this revision.

The overall impact of the pg_depend change is that required regression
test output changes are *far* less numerous than they were in v7.
There are now only trivial differences in the output order of items.
And, there are very few diagnostic message changes overall -- we see
exactly 5 changes now, rather than dozens. Importantly, there is no
longer any question about whether I could make diagnostic messages
less useful to users, because the existing behavior for
findDependentObjects() is retained. This is an independent
improvement, since it fixes an independent problem with test
flappiness that we've been papering-over for some time [2]/messages/by-id/11852.1501610262@sss.pgh.pa.us -- Peter Geoghegan -- I make
the required order actually-deterministic, removing heap TID ordering
as a factor that can cause seemingly-random regression test failures
on slow/overloaded buildfarm animals.

Robert Haas remarked that he thought that the pg_depend index
tie-breaker commit's approach is acceptable [1]/messages/by-id/CA+TgmoYNeFxdPimiXGL=tCiCXN8zWosUFxUfyDBaTd2VAg-D9w@mail.gmail.com -- see the other
thread that Robert weighed in on for all the gory details. The patch's
draft commit message may also be interesting. Note that adding a new
column turns out to have *zero* storage overhead, because we only ever
end up filling up space that was already getting lost to alignment.

The pg_depend thing is clearly a kludge. It's ugly, though in no small
part because it acknowledges the existing reality of how
findDependentObjects() already depends on scan order. I'm optimistic
that I'll be able to push this groundwork commit before too long; it
doesn't hinge on whether or not the nbtree patches are any good.

[1]: /messages/by-id/CA+TgmoYNeFxdPimiXGL=tCiCXN8zWosUFxUfyDBaTd2VAg-D9w@mail.gmail.com
[2]: /messages/by-id/11852.1501610262@sss.pgh.pa.us -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v8-0003-Pick-nbtree-split-points-discerningly.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Pick-nbtree-split-points-discerningly.patchDownload

From 482386008d03013e525fd4024a1dc9f376eceb52 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v8 3/6] Pick nbtree split points discerningly.

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.  Use this
within _bt_findsplitloc() to weigh how effective suffix truncation will
be.  This is primarily useful because it maximizes the effectiveness of
suffix truncation.  This should not noticeably affect the balance of
free space within each half of the split.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/README      |  66 ++-
 src/backend/access/nbtree/nbtinsert.c | 638 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  |  78 ++++
 src/include/access/nbtree.h           |   8 +-
 4 files changed, 719 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 75cb1d1e22..6f7297b522 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -165,9 +165,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -669,6 +669,66 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+"between" items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already, without provoking a split.
+The split point between two index tuples with differences that appear as
+early as possible allows us to truncate away as many attributes as
+possible.
+
+Obviously suffix truncation is valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been, without actually making any pivot tuple smaller due to restrictions
+relating to alignment.  The criteria for choosing a leaf page split point
+for suffix truncation is often also predictive of future space utilization.
+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which leaf pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the optimal fillfactor-wise split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+With v4 B-Trees, every tuple at the leaf level must be individually
+locatable by an insertion scankey that's fully filled-out by
+_bt_mkscankey().  Heap TID is treated as a tie-breaker key attribute to
+make this work.  Suffix truncation must occasionally make a pivot tuple
+*larger* than the leaf tuple that it's based on, since a heap TID must be
+appended when nothing else distinguishes each side of a leaf split.  This
+is not represented in the same way as it is at the leaf level (we must
+append an additional attribute), since pivot tuples already use the generic
+IndexTuple fields to describe which child page they point to, and how many
+attributes are in the pivot tuple.  Adding a heap TID attribute during a
+leaf page split should only occur when there is an entire page full of
+duplicates, though, since the logic for selecting a split point will do all
+it can to avoid this outcome --- it may apply "many duplicates" mode, or
+"single value" mode.
+
+Avoiding appending a heap TID to a pivot tuple is about much more than just
+saving a single MAXALIGN() quantum in each of the pages that store the new
+pivot.  It's worth going out of our way to avoid having a single value (or
+composition of key values) span two leaf pages when that isn't truly
+necessary, since if that's allowed to happen every point index scan will
+have to visit both pages.  It also makes it less likely that VACUUM will be
+able to perform page deletion on either page.  Finally, it's not unheard of
+for unique indexes to have pages full of duplicates in the event of extreme
+contention (which appears as buffer lock contention) --- this is also
+ameliorated.  These are all examples of how "false sharing" across B-Tree
+pages can cause performance problems.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 318cbd3551..0e37b8b23a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,25 +28,44 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+/* _bt_findsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
+	SplitMode	mode;			/* strategy for deciding split point */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	double		fillfactor;		/* needed for weighted splits */
+	int			goodenough;
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	bool		hikeyheaptid;	/* T if high key will likely get heap TID */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +95,21 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int  _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+				 OffsetNumber newitemoff, IndexTuple newitem,
+				 SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -990,8 +1018,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/*
@@ -1687,6 +1715,30 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case.  There
+ * is still a modest benefit to choosing a split location while weighing
+ * suffix truncation: the resulting (untruncated) pivot tuples are
+ * nevertheless more predictive of future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1695,6 +1747,16 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that's even more left-heavy than the fillfactor used for
+ * rightmost pages.  This greatly helps with space management in cases where
+ * tuples with the same attribute values span multiple pages.  Newly
+ * inserted duplicates will tend to have higher heap TID values, so we'll
+ * end up splitting to the right in the manner of ascending insertions of
+ * monotonically increasing values.  See nbtree/README for more information
+ * about suffix truncation, and how a split point is chosen.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1725,8 +1787,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1736,15 +1800,16 @@ _bt_findsplitloc(Relation rel,
 	FindSplitData state;
 	int			leftspace,
 				rightspace,
-				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectpenalty;
 	bool		goodenoughfound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
@@ -1762,18 +1827,60 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.mode = mode;
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.hikeyheaptid = (mode == SPLIT_SINGLE_VALUE);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (state.mode != SPLIT_SINGLE_VALUE)
+		{
+			/* Only used on rightmost page */
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR) / 100.0;
+		}
+		else
+		{
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+			state.is_weighted = true;
+		}
+	}
 	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+	{
+		Assert(state.mode == SPLIT_DEFAULT);
+		/* Only used on rightmost page */
+		state.fillfactor = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+
+	/*
+	 * Set limits on the split interval/number of candidate split points as
+	 * appropriate.  The "Prefix B-Trees" paper refers to this as sigma l for
+	 * leaf splits and sigma b for internal ("branch") splits.  It's hard to
+	 * provide a theoretical justification for the size of the split interval,
+	 * though it's clear that a small split interval improves space
+	 * utilization.
+	 *
+	 * (Also set interval for case when we split a page that has many
+	 * duplicates, or split a page that's entirely full of tuples of a single
+	 * value.  Future locality of access is prioritized over short-term space
+	 * utilization in these cases.)
+	 */
+	if (!state.is_leaf)
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	else if (state.mode == SPLIT_DEFAULT)
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	else if (state.mode == SPLIT_MANY_DUPLICATES)
+		state.maxsplits = maxoff + 2;
+	else
+		state.maxsplits = 1;
+	state.nsplits = 0;
+	if (state.mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1782,13 +1889,15 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as we find sufficiently-many "good-enough"
+	 * splits, where good-enough is defined as an imbalance in free space of
+	 * no more than pagesize/16 (arbitrary...) This should let us stop near
+	 * the middle on most pages, instead of plowing to the end.  Many
+	 * duplicates mode must consider all possible choices, and so does not use
+	 * this threshold for anything.
 	 */
-	goodenough = leftspace / 16;
+	state.goodenough = leftspace / 16;
 
 	/*
 	 * Scan through the data items and calculate space usage for a split at
@@ -1796,13 +1905,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1811,28 +1920,35 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
 			goodenoughfound = true;
+
+		/*
+		 * Abort scan once we've found a good-enough choice, and reach the
+		 * point where we stop finding new good-enough choices.  Don't do this
+		 * in many duplicates mode, though, since that has to be completely
+		 * exhaustive.
+		 */
+		if (goodenoughfound && state.mode != SPLIT_MANY_DUPLICATES &&
+			delta > state.goodenough)
 			break;
-		}
 
 		olddataitemstoleft += itemsz;
 	}
@@ -1842,19 +1958,50 @@ _bt_findsplitloc(Relation rel,
 	 * the old items go to the left page and the new item goes to the right
 	 * page.
 	 */
-	if (newitemoff > maxoff && !goodenoughfound)
+	if (newitemoff > maxoff &&
+		(!goodenoughfound || state.mode == SPLIT_MANY_DUPLICATES))
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to increase fan-out, by choosing a split point which is
+	 * amenable to being made smaller by suffix truncation, or is already
+	 * small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  This will be passed to _bt_bestsplitloc() if it
+	 * determines that candidate split points are good enough to finish
+	 * default mode split.  Perfect penalty saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Start second pass over page if _bt_perfect_penalty() told us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1869,8 +2016,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1878,7 +2028,8 @@ _bt_checksplitloc(FindSplitData *state,
 				  Size firstoldonrightsz)
 {
 	int			leftfree,
-				rightfree;
+				rightfree,
+				leftleafheaptidsz;
 	Size		firstrightitemsz;
 	bool		newitemisfirstonright;
 
@@ -1898,15 +2049,38 @@ _bt_checksplitloc(FindSplitData *state,
 
 	/*
 	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
 	 */
 	leftfree -= firstrightitemsz;
 
+	/*
+	 * Assume that suffix truncation cannot avoid adding a heap TID to the
+	 * left half's new high key when splitting at the leaf level.  Don't let
+	 * this impact the balance of free space in the common case where adding a
+	 * heap TID is considered very unlikely, though, since there is no reason
+	 * to accept a likely-suboptimal split.
+	 *
+	 * When adding a heap TID seems likely, then actually factor that in to
+	 * delta calculation, rather than just having it as a constraint on
+	 * whether or not a split is acceptable.
+	 */
+	leftleafheaptidsz = 0;
+	if (state->is_leaf)
+	{
+		if (!state->hikeyheaptid)
+			leftleafheaptidsz = sizeof(ItemPointerData);
+		else
+			leftfree -= (int) sizeof(ItemPointerData);
+	}
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -1922,20 +2096,23 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
-	if (leftfree >= 0 && rightfree >= 0)
+	if (leftfree - leftleafheaptidsz >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
+				- ((1.0 - state->fillfactor) * rightfree);
 		}
 		else
 		{
@@ -1945,14 +2122,341 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway.
+		 *
+		 * We cannot do this in many duplicates mode, because that hurts cases
+		 * where there are a small number of available distinguishing split
+		 * points, and consistently picking the least worst choice among them
+		 * matters. (e.g., a non-unique index whose leaf pages each contain a
+		 * small number of distinct values, with each value duplicated a
+		 * uniform number of times.)
+		 */
+		if (delta > state->goodenough && state->mode != SPLIT_MANY_DUPLICATES)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/* No point calculating penalties in trivial cases */
+	if (perfectpenalty == INT_MAX || state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
 		}
 	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 */
+	if (state->mode == SPLIT_MANY_DUPLICATES)
+		return IndexRelationGetNumberOfKeyAttributes(rel);
+	else if (state->mode == SPLIT_SINGLE_VALUE)
+		return INT_MAX;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  It's important that
+	 * many duplicates mode has every opportunity to avoid picking a split
+	 * point that requires that suffix truncation append a heap TID to new
+	 * pivot tuple.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel);
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller must perform when even a
+	 * "perfect" penalty fails to avoid appending a heap TID to new pivot
+	 * tuple.
+	 */
+	if (perfectpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			outerpenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		outerpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates, a single value mode pass will be performed.
+		 *
+		 * Caller should avoid a single value mode pass when incoming tuple
+		 * doesn't sort highest among items on the page, though.  Instead, we
+		 * instruct caller to continue with original default mode split, since
+		 * an out-of-order new duplicate item predicts further inserts towards
+		 * the left/middle of the original page's keyspace.  Evenly sharing
+		 * space among each half of the split avoids pathological performance.
+		 */
+		if (outerpenalty > IndexRelationGetNumberOfKeyAttributes(rel))
+		{
+			if (maxoff < newitemoff)
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				perfectpenalty = INT_MAX;
+				*secondmode = SPLIT_DEFAULT;
+			}
+		}
+		else
+			*secondmode = SPLIT_MANY_DUPLICATES;
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID needs to be appending during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	lastleft;
+	IndexTuple	firstright;
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, OffsetNumberPrev(split->firstright));
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	Assert(lastleft != firstright);
+	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 629066fcf9..449b5bc63b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2323,6 +2324,83 @@ _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return leavenatts;
 }
 
+/*
+ * _bt_leave_natts_fast - fast, approximate variant of _bt_leave_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_leave_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_leave_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases. However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_leave_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_leave_natts(rel, lastleft, firstright);
+#endif
+
+	leavenatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		leavenatts++;
+	}
+
+	return leavenatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 30340e9c02..995fb8cc8d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -144,11 +144,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -706,6 +710,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, bool build);
+extern int _bt_leave_natts_fast(Relation rel, IndexTuple lastleft,
+					 IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
 					 IndexTuple newtup);
-- 
2.17.1

v8-0004-Add-split-at-new-tuple-page-split-optimization.patchtext/x-patch; charset=US-ASCII; name=v8-0004-Add-split-at-new-tuple-page-split-optimization.patchDownload

From 98197e834343b804308f681b7110444499c79eed Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v8 4/6] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing adjacent heap TIDs.  Only non-rightmost pages are
affected, to preserve existing behavior.

This enhancement is new to version 6 of the patch series.  This
enhancement has been demonstrated to be very effective at avoiding index
bloat when initial bulk INSERTs for the TPC-C benchmark are run.
Evidently, the primary keys for all of the largest indexes in the TPC-C
schema are populated through localized, monotonically increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.  (Note that the patch series generally has very
little advantage over master if the indexes are rebuilt via a REINDEX,
with or without this later commit.)

I (Peter Geoghegan) will provide reviewers with a convenient copy of
this test data if asked.  It comes from the oltpbench fair-use
implementation of TPC-C [1], but the same issue has independently been
observed with the BenchmarkSQL implementation of TPC-C [2].

Note that this commit also recognizes and prevents bloat with
monotonically *decreasing* tuple insertions (e.g., single-DESC-attribute
index on a date column).  Affected cases will typically leave their
index structure slightly smaller than an equivalent monotonically
increasing case would.

[1] http://oltpbenchmark.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
---
 src/backend/access/nbtree/nbtinsert.c | 186 +++++++++++++++++++++++++-
 1 file changed, 184 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0e37b8b23a..778805d6c1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -100,6 +100,8 @@ static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_dosplitatnewitem(Relation rel, Page page,
+					 OffsetNumber newitemoff, IndexTuple newitem);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -110,6 +112,7 @@ static int  _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
 				 SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -1745,7 +1748,13 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * etc) we will end up with a tree whose pages are about fillfactor% full,
  * instead of the 50% full result that we'd get without this special case.
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * the fillfactor% is determined dynamically when _bt_dosplitatnewitem()
+ * indicates that there are localized monotonically increasing insertions,
+ * or monotonically decreasing (DESC order) insertions. (This can only
+ * happen with the default strategy, and should be thought of as a variant
+ * of the fillfactor% special case that is applied only when inserting into
+ * non-rightmost pages.)
  *
  * If called recursively in single value mode, we also try to arrange to
  * leave the left split page fillfactor% full, though we arrange to use a
@@ -1835,7 +1844,28 @@ _bt_findsplitloc(Relation rel,
 	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
 	{
-		if (state.mode != SPLIT_SINGLE_VALUE)
+		/*
+		 * Consider split at new tuple optimization.  See
+		 * _bt_dosplitatnewitem() for an explanation.
+		 */
+		if (state.mode == SPLIT_DEFAULT && !P_RIGHTMOST(opaque) &&
+			_bt_dosplitatnewitem(rel, page, newitemoff, newitem))
+		{
+			/*
+			 * fillfactor% is dynamically set through interpolation of the
+			 * new/incoming tuple's offset position
+			 */
+			if (newitemoff > maxoff)
+				state.fillfactor = (double) BTREE_DEFAULT_FILLFACTOR / 100.0;
+			else if (newitemoff == P_FIRSTDATAKEY(opaque))
+				state.fillfactor = (double) BTREE_MIN_FILLFACTOR / 100.0;
+			else
+				state.fillfactor =
+					((double) newitemoff / (((double) maxoff + 1)));
+
+			state.is_weighted = true;
+		}
+		else if (state.mode != SPLIT_SINGLE_VALUE)
 		{
 			/* Only used on rightmost page */
 			state.fillfactor = RelationGetFillFactor(rel,
@@ -2174,6 +2204,126 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at
+ * approximately the point that the new/incoming item would have been
+ * inserted.
+ *
+ * This routine infers two distinct cases in which splitting around the new
+ * item's insertion point is likely to lead to better space utilization over
+ * time:
+ *
+ * - Composite indexes that consist of one or more leading columns that
+ *   describe some grouping, plus a trailing, monotonically increasing
+ *   column.  If there happened to only be one grouping then the traditional
+ *   rightmost page split default fillfactor% would be used to good effect,
+ *   so it seems worth recognizing this case.  This usage pattern is
+ *   prevalent in the TPC-C benchmark, and is assumed to be common in real
+ *   world applications.
+ *
+ * - DESC-ordered insertions, including DESC-ordered single (non-heap-TID)
+ *   key attribute indexes.  We don't want the performance of explicitly
+ *   DESC-ordered indexes to be out of line with an equivalent ASC-ordered
+ *   index.  Also, there may be organic cases where items are continually
+ *   inserted in DESC order for an index with ASC sort order.
+ *
+ * Caller uses fillfactor% rather than using the new item offset directly
+ * because it allows suffix truncation to be applied using the usual
+ * criteria, which can still be helpful.  This approach is also more
+ * maintainable, since restrictions on split points can be handled in the
+ * usual way.
+ *
+ * Localized insert points are inferred here by observing that neighboring
+ * heap TIDs are "adjacent".  For example, if the new item has distinct key
+ * attribute values to the existing item that belongs to its immediate left,
+ * and the item to its left has a heap TID whose offset is exactly one less
+ * than the new item's offset, then caller is told to use its new-item-split
+ * strategy.  It isn't of much consequence if this routine incorrectly
+ * infers that an interesting case is taking place, provided that that
+ * doesn't happen very often.  In particular, it should not be possible to
+ * construct a test case where the routine consistently does the wrong
+ * thing.  Since heap TID "adjacency" is such a delicate condition, and
+ * since there is no reason to imagine that random insertions should ever
+ * consistent leave new tuples at the first or last position on the page
+ * when a split is triggered, that will never happen.
+ *
+ * Note that we avoid using the split-at-new fillfactor% when we'd have to
+ * append a heap TID during suffix truncation.  We also insist that there
+ * are no varwidth attributes or NULL attribute values in new item, since
+ * that invalidates interpolating from the new item offset.  Besides,
+ * varwidths generally imply the use of datatypes where ordered insertions
+ * are not a naturally occurring phenomenon.
+ */
+static bool
+_bt_dosplitatnewitem(Relation rel, Page page, OffsetNumber newitemoff,
+					 IndexTuple newitem)
+{
+	ItemId		itemid;
+	OffsetNumber maxoff;
+	BTPageOpaque opaque;
+	IndexTuple	tup;
+	int16		nkeyatts;
+
+	if (IndexTupleHasNulls(newitem) || IndexTupleHasVarwidths(newitem))
+		return false;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Avoid optimization entirely on pages with large items */
+	if (maxoff <= 3)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/*
+	 * When heap TIDs appear in DESC order, consider left-heavy split.
+	 *
+	 * Accept left-heavy split when new item, which will be inserted at first
+	 * data offset, has adjacent TID to extant item at that position.
+	 */
+	if (newitemoff == P_FIRSTDATAKEY(opaque))
+	{
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/* Single key indexes only use DESC optimization */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * When tuple heap TIDs appear in ASC order, consider right-heavy split,
+	 * even though this may not be the right-most page.
+	 *
+	 * Accept right-heavy split when new item, which belongs after any
+	 * existing page offset, has adjacent TID to extant item that's the last
+	 * on the page.
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/*
+	 * When new item is approximately in the middle of the page, look for
+	 * adjacency among new item, and extant item that belongs to the left of
+	 * the new item in the keyspace.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+
+	return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+		_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -2459,6 +2609,38 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction, and probably not through heap_update().  This is not a
+ * commutative condition.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+	OffsetNumber lowoff,
+				highoff;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+	lowoff = ItemPointerGetOffsetNumber(lowhtid);
+	highoff = ItemPointerGetOffsetNumber(highhtid);
+
+	/* When heap blocks match, second offset should be one up */
+	if (lowblk == highblk && OffsetNumberNext(lowoff) == highoff)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk && highoff == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
-- 
2.17.1

v8-0005-Add-high-key-continuescan-optimization.patchtext/x-patch; charset=US-ASCII; name=v8-0005-Add-high-key-continuescan-optimization.patchDownload

From 0bd4225b9214e0dcbcdb498004c48d1f3b2ad34d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v8 5/6] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++---
 src/backend/access/nbtree/nbtutils.c  | 60 +++++++++++++++++++++------
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aceadf44ed..c3d6bf8b9b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1427,7 +1427,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber maxoff;
 	int			itemIndex;
 	IndexTuple	itup;
-	bool		continuescan;
+	bool		continuescan = true;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1495,16 +1495,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you'd
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples within a range of acceptable split points.  There
+		 * is often natural locality around what ends up on each leaf page,
+		 * which is worth taking advantage of here.
+		 */
+		if (!P_RIGHTMOST(opaque) && continuescan)
+			(void) _bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 449b5bc63b..0d2f43ee58 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
 				IndexTuple firstright, bool build);
@@ -1393,7 +1393,10 @@ _bt_mark_scankey_required(ScanKey skey)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1403,6 +1406,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1416,21 +1420,24 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		Assert(offnum != P_HIKEY || P_RIGHTMOST(opaque));
 		if (ScanDirectionIsForward(dir))
 		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
+			/* forward scan callers check high key instead */
+			return NULL;
 		}
 		else
 		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
+			/* return immediately if there are more tuples on the page */
 			if (offnum > P_FIRSTDATAKEY(opaque))
 				return NULL;
 		}
@@ -1445,6 +1452,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1456,11 +1464,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1591,8 +1612,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1609,6 +1630,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v8-0006-DEBUG-Add-pageinspect-instrumentation.patchtext/x-patch; charset=US-ASCII; name=v8-0006-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 54e90728f929c6045f221d12efade801c9bbf222 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v8 6/6] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bee1f1c9d9..93660de557 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -242,6 +243,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -253,9 +255,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -264,6 +266,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -282,16 +286,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -365,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -396,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -481,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v8-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchDownload

From 03ba25231a44322df619d2c40ed7381f14050543 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v8 2/6] Treat heap TID as part of the nbtree key space.

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, at least in principle.  Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).  This general approach has numerous benefits for
performance, and is prerequisite to teaching VACUUM to perform "retail
index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially when there are several attributes in an index.  Truncation
can only occur at the attribute granularity, which isn't particularly
effective, but works well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with BTREE_VERSIONs 2 and
3, while also enforcing the newer/more strict invariants with
BTREE_VERSION 4 indexes.

We no longer allow a search for free space among multiple pages full of
duplicates to "get tired", except when needed to preserve compatibility
with earlier versions.  This has significant benefits for free space
management in secondary indexes on low cardinality attributes.  However,
without the next commit in the patch series (without having "single
value" mode and "many duplicates" mode within _bt_findsplitloc()), these
cases will be significantly regressed, since they'll naively perform
50:50 splits without there being any hope of reusing space left free on
the left half of the split.
---
 contrib/amcheck/verify_nbtree.c              | 364 ++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 164 ++++--
 src/backend/access/nbtree/nbtinsert.c        | 584 ++++++++++++-------
 src/backend/access/nbtree/nbtpage.c          | 197 +++++--
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 371 +++++++++---
 src/backend/access/nbtree/nbtsort.c          |  79 +--
 src/backend/access/nbtree/nbtutils.c         | 359 ++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  11 +-
 src/include/access/nbtree.h                  | 161 ++++-
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/tools/pgindent/typedefs.list             |   4 +
 22 files changed, 1751 insertions(+), 656 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..85c227dc35 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,28 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
+static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool isleaf);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +843,9 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -860,7 +870,6 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn),
 					 errhint("This could be a torn page problem.")));
-
 		/* Check the number of index tuple attributes */
 		if (!_bt_check_natts(state->rel, state->target, offset))
 		{
@@ -902,8 +911,66 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		skey = _bt_mkscankey(state->rel, itup, false);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a MAXALIGN() quantum of space from BTMaxItemSize() in order to
+		 * ensure that suffix truncation always has enough space to add an
+		 * explicit heap TID back to a tuple -- we pessimistically assume that
+		 * every newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since a MAXALIGN() quantum is reserved for that purpose, we must
+		 * not enforce the slightly lower limit when the extra quantum has
+		 * been used as intended.  In other words, there is only a
+		 * cross-version difference in the limit on tuple size within leaf
+		 * pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra quantum for
+		 * its designated purpose.  Enforce the lower limit for pivot tuples
+		 * when an explicit heap TID isn't actually present. (In all other
+		 * cases suffix truncation is guaranteed to generate a pivot tuple
+		 * that's no larger than the first right tuple provided to it by its
+		 * caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -928,9 +995,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+								  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +1049,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1109,20 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			IndexTuple	righttup;
+			BTScanInsert rightkey;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+				rightkey = _bt_mkscankey(state->rel, righttup, false);
+
+			if (righttup && !invariant_g_offset(state, rightkey, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1165,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1083,9 +1179,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1194,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1383,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1304,8 +1399,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1449,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1403,15 +1499,29 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1861,60 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup,
+											   P_ISLEAF(topaque));
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1923,104 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  That could cause
+	 * us to miss the fact that the scankey is less than rather than equal to
+	 * its lower bound, but the index is corrupt either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search, not as negative
+	 * infinity (only tuples within the index are treated as negative
+	 * infinity).  Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid =
+			BTreeTupleGetHeapTIDCareful(state, child, P_ISLEAF(copaque));
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2176,32 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ *
+ * Note that it is incorrect to specify the tuple as a non-pivot when passing a
+ * leaf tuple that came from the high key offset, since that is actually a
+ * pivot tuple.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 184ac62255..bee1f1c9d9 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -560,7 +560,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_META_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index aa52a96259..7a18da2e97 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..75cb1d1e22 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,60 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
+pointer order.  We don't use btree keys to disambiguate downlinks from the
+internal pages during a page split, though: only one entry in the parent
+level will be pointing at the page we just split, so the link fields can be
+used to re-find downlinks in the parent via a linear search.  (This is
+actually a legacy of when heap TID was not treated as part of the keyspace,
+but it does no harm to keep things that way.)
+
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to distinguish tuples which don't point to heap tuples, that are
+used only for tree navigation.  Pivot tuples include all tuples on
+non-leaf pages and high keys on leaf pages.  Note that pivot tuples are
+only used to represent which part of the key space belongs on each page,
+and can have attribute values copied from non-pivot tuples that were
+deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
+"separator" key and downlink, just a separator key (in practice the
+downlink will be garbage), or just a downlink.  We aren't always clear on
+which case applies, but it should be obvious from context.
+
+Lehman and Yao require that the key range for a subtree S is described by
+Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
+A search where the scan key is equal to a pivot tuple in an upper tree
+level must descend to the left of that pivot to ensure it finds any equal
+keys.  Pivot tuples are always a _strict_ lower bound on items on their
+downlink page; the equal item(s) being searched for must therefore be to
+the left of that downlink page on the next level down.  (It's possible to
+arrange for internal page tuples to be strict lower bounds in all cases
+because their values come from leaf tuples, which are guaranteed unique by
+the use of heap TID as a tiebreaker.  We also make use of hard-coded
+negative infinity values in internal pages.  Rightmost pages don't have a
+high key, though they conceptually have a positive infinity high key).  A
+handy property of this design is that there is never any need to
+distinguish between equality in the case where all attributes/keys are used
+in a scan from equality where only some prefix is used.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -598,33 +621,53 @@ the order of multiple keys for a given column is unspecified.)  An
 insertion scankey uses the same array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is exactly one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.  A truncated tuple logically retains all key attributes, though
+they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +680,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +707,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..318cbd3551 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -52,19 +52,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool uniqueindex,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -72,7 +72,7 @@ static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   bool split_only_page);
 static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+		  IndexTuple newitem, bool newitemonleft, bool truncate);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -84,8 +84,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -111,18 +111,21 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	Assert(IndexRelationGetNumberOfKeyAttributes(rel) != 0);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_scankey = _bt_mkscankey(rel, itup, false);
+top:
+	/* Cannot use real heap TID in unique case -- it'll be restored later */
+	if (itup_scankey->heapkeyspace && checkUnique != UNIQUE_CHECK_NO)
+		itup_scankey->scantid = _bt_lowest_scantid();
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -143,14 +146,10 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 * other backend might be concurrently inserting into the page, thus
 	 * reducing our chances to finding an insertion place in this page.
 	 */
-top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -174,14 +173,17 @@ top:
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
 			 * free space to accommodate the new tuple, and the insertion scan
-			 * key is strictly greater than the first key on the page.
+			 * key is strictly greater than the first key on the page.  The
+			 * itup_scankey scantid value may prevent the optimization from
+			 * being applied despite being safe when it was temporarily set to
+			 * a sentinel low value, though only when the page is full of
+			 * duplicates.
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +222,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +232,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -249,9 +251,24 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * Arrange for the later _bt_findinsertloc call to _bt_binsrch to
+		 * avoid repeating the work done during this initial _bt_binsrch call.
+		 * Clear the _bt_lowest_scantid-supplied scantid value first, though,
+		 * so that the itup_scankey-cached low and high bounds will enclose a
+		 * range of offsets in the event of multiple duplicates. (Our
+		 * _bt_binsrch call cannot be allowed to incorrectly enclose a single
+		 * offset: the offset of the first duplicate among many on the page.)
+		 */
+		itup_scankey->scantid = NULL;
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -274,10 +291,16 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_scankey->heapkeyspace)
+			itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -288,10 +311,11 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		/* do the insertion, possibly on a page to the right in unique case */
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf,
+									  checkUnique != UNIQUE_CHECK_NO, itup,
+									  stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -302,7 +326,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -327,13 +351,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -344,6 +367,10 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
 
+	/* Fast path for case where there are clearly no duplicates */
+	if (itup_scankey->low >= itup_scankey->high)
+		return InvalidTransactionId;
+
 	InitDirtySnapshot(SnapshotDirty);
 
 	page = BufferGetPage(buf);
@@ -392,7 +419,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -553,11 +580,23 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/*
+			 * If scankey <= hikey (leaving out the heap TID attribute), we
+			 * gotta check the next page too.
+			 *
+			 * We cannot get away with giving up without going to the next
+			 * page when true key values are all == hikey, because heap TID is
+			 * ignored when considering duplicates (caller is sure to not
+			 * provide a scantid in scankey).  We could get away with this in
+			 * a hypothetical world where unique indexes certainly never
+			 * contain physical duplicates, since heap TID would never be
+			 * treated as part of the keyspace --- not here, and not at any
+			 * other point.
+			 */
+			Assert(itup_scankey->scantid == NULL);
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			if (_bt_compare(rel, itup_scankey, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -599,52 +638,53 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool uniqueindex,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = uniqueindex;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -652,77 +692,65 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
+	/* Check 1/3 of a page restriction */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, page, newtup);
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop));
+	Assert(!itup_scankey->heapkeyspace || itup_scankey->scantid != NULL);
+	Assert(itup_scankey->heapkeyspace || itup_scankey->scantid == NULL);
+	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
+		int			cmpval;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * No need to check high key when inserting into a non-unique index --
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required.  Insertion scankey's scantid would have been
+		 * filled out at the time.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+		if (itup_scankey->heapkeyspace && !uniqueindex)
 		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
+			Assert(P_RIGHTMOST(lpageop) ||
+				   _bt_compare(rel, itup_scankey, page, P_HIKEY) <= 0);
+			break;
 		}
 
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (P_RIGHTMOST(lpageop))
 			break;
+		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+		if (itup_scankey->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -731,6 +759,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -764,27 +794,98 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform micro-vacuuming of the page we're about to insert tuple on if
+	 * it looks like it has LP_DEAD items.  Only micro-vacuum when it might
+	 * forestall a page split, though.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+	Assert(!itup_scankey->restorebinsrch);
+	/* XXX: may use too many cycles to be a simple assertion */
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one page, a new tuple of that same value
+ *		could legally be placed on any one of the pages.  This function
+ *		handles the question of whether or not an insertion of a duplicate
+ *		into a pg_upgrade'd !heapkeyspace index should insert on the page
+ *		contained in buf when that's a legal choice.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move to page to the right.  Caller calls
+ *		here again if that next page isn't where the duplicates end, and
+ *		another choice must be made.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	/*
+	 * Perform micro-vacuuming of the page if it looks like it has LP_DEAD
+	 * items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * _bt_findinsertloc() may need to split the page to put the item on
+	 * this page, check whether we can put the tuple somewhere to the
+	 * right, instead.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -833,6 +934,8 @@ _bt_insertonpg(Relation rel,
 	BTPageOpaque lpageop;
 	OffsetNumber firstright = InvalidOffsetNumber;
 	Size		itemsz;
+	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -840,12 +943,9 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
-	Assert(!P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
-	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
+	Assert(!P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) == indnatts);
+	Assert(P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) <= indnkeyatts);
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -867,6 +967,7 @@ _bt_insertonpg(Relation rel,
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
+		bool		truncate;
 		bool		newitemonleft;
 		Buffer		rbuf;
 
@@ -893,9 +994,16 @@ _bt_insertonpg(Relation rel,
 									  newitemoff, itemsz,
 									  &newitemonleft);
 
+		/*
+		 * Perform truncation of the new high key for the left half of the
+		 * split when splitting a leaf page.  Don't do so with version 3
+		 * indexes unless the index has non-key attributes.
+		 */
+		truncate = P_ISLEAF(lpageop) &&
+			(_bt_heapkeyspace(rel) || indnatts != indnkeyatts);
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+						 newitemoff, itemsz, itup, newitemonleft, truncate);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -977,7 +1085,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_META_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1032,6 +1140,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_META_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1097,7 +1207,10 @@ _bt_insertonpg(Relation rel,
  *		On entry, buf is the page to split, and is pinned and write-locked.
  *		firstright is the item index of the first item to be moved to the
  *		new right page.  newitemoff etc. tell us about the new item that
- *		must be inserted along with the data from the old page.
+ *		must be inserted along with the data from the old page.  truncate
+ *		tells us if the new high key should undergo suffix truncation.
+ *		(Version 4 pivot tuples always have an explicit representation of
+ *		the number of non-truncated attributes that remain.)
  *
  *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
  *		page we're inserting the downlink for.  This function will clear the
@@ -1109,7 +1222,7 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+		  bool newitemonleft, bool truncate)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1132,8 +1245,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1314,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1330,14 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page.  Despite appearances, the new high key is generated in a way
+	 * that's consistent with their approach.  See comments above
+	 * _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1355,60 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (truncate)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		Assert(isleaf);
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, false);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1601,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1629,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1650,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1564,6 +1703,24 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * righthand page, plus a boolean indicating whether the new tuple goes on
  * the left or right page.  The bool is necessary to disambiguate the case
  * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key.  It isn't, though;
+ * suffix truncation will leave the left page's high key equal to the last
+ * item on the left page when two tuples with equal key values enclose the
+ * split point.  It's convenient to always express a split point as a
+ * firstright offset due to internal page splits, which leave us with a
+ * right half whose first item becomes a negative infinity item through
+ * truncation to 0 attributes.  In effect, internal page splits store
+ * firstright's "separator" key at the end of the left page (as left's new
+ * high key), and store its downlink at the start of the right page.  In
+ * other words, internal page splits conceptually split in the middle of the
+ * firstright tuple, not on either side of it.  Crucially, when splitting
+ * either a leaf page or an internal page, the new high key will be strictly
+ * less than the first item on the right page in all cases, despite the fact
+ * that we start with the assumption that firstright becomes the new high
+ * key.
  */
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
@@ -1874,7 +2031,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -2164,7 +2321,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2199,7 +2356,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2232,6 +2390,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2296,6 +2456,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2311,28 +2472,25 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..171427c94f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *	3, the last version that can be updated without broadly affecting on-disk
+ *	compatibility.  (A REINDEX is required to upgrade to version 4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_META_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_META_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_META_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to BTREE_VERSION/version 4 without a
+		 * REINDEX, since extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_META_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_META_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_META_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_META_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1370,7 +1439,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1420,12 +1489,20 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_scankey = _bt_mkscankey(rel, targetkey, false);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
+
+				/*
+				 * Search will reliably relocate same leaf page.
+				 *
+				 * (However, prior to version 4 the search is for the leftmost
+				 * leaf page containing this key, which is okay because we
+				 * will tiebreak on downlink block number.)
+				 */
+				Assert(!itup_scankey->heapkeyspace ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
@@ -1970,7 +2047,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2018,6 +2095,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_META_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..9cf760ffa0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 16223d01ec..aceadf44ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,10 @@
 #include "utils/tqual.h"
 
 
+static inline int32 _bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -38,6 +42,7 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+static ItemPointerData lowest;
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -72,12 +77,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.  If key was built
+ * using a leaf high key, leaf page will be relocated.
  *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
@@ -94,8 +96,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +147,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +160,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * required when dealing with pg_upgrade'd !heapkeyspace indexes.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,8 +201,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -216,16 +218,16 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the key.nextkey=true
  * case, then we followed the wrong link and we need to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -242,10 +244,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -258,11 +258,15 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.  Duplicate pivots on
+	 * internal pages are useless to all index scans, which was a flaw in the
+	 * old design.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -270,7 +274,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -305,7 +309,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -325,13 +329,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,37 +344,75 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				savehigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	isleaf = P_ISLEAF(opaque);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state when scantid is available */
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->heapkeyspace || !key->restorebinsrch || key->scantid != NULL);
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available slot.
+		 * Note this covers two cases: the page is really empty (no keys), or
+		 * it contains only a high key.  The latter case is possible after
+		 * vacuuming.  This can never happen on an internal page, however,
+		 * since they are never empty (an internal page must have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				key->low = low;
+				key->high = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;					/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->high;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (high < low)
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -391,22 +426,40 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	savehigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		if (!isleaf)
+			result = _bt_compare(rel, key, page, mid);
+		else
+			result = _bt_nonpivot_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				savehigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->high = savehigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -421,7 +474,8 @@ _bt_binsrch(Relation rel,
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
-	 * There must be one if _bt_compare() is playing by the rules.
+	 * There must be one if _bt_compare()/_bt_tuple_compare() is playing by
+	 * the rules.
 	 */
 	Assert(low > P_FIRSTDATAKEY(opaque));
 
@@ -431,21 +485,11 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
- *		This routine returns:
- *			<0 if scankey < tuple at offnum;
- *			 0 if scankey == tuple at offnum;
- *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
- *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored (if any, which there probably isn't)
@@ -456,26 +500,82 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
-	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	int			ntupatts;
 
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
+	return _bt_tuple_compare(rel, key, itup, ntupatts);
+}
+
+/*
+ * Optimized version of _bt_compare().  Only works on non-pivot tuples.
+ */
+static inline int32
+_bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum)
+{
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(BTreeTupleGetNAtts(itup, rel) ==
+		   IndexRelationGetNumberOfAttributes(rel));
+	return _bt_tuple_compare(rel, key, itup, key->keysz);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
+ * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ *		This routine returns:
+ *			<0 if scankey < tuple;
+ *			 0 if scankey == tuple;
+ *			>0 if scankey > tuple.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
+ *----------
+ */
+int32
+_bt_tuple_compare(Relation rel,
+				  BTScanInsert key,
+				  IndexTuple itup,
+				  int ntupatts)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ItemPointer heapTid;
+	int			ncmpkey;
+	int			i;
+	ScanKey		scankey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +589,9 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, key->keysz);
+	scankey = key->scankeys;
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +642,82 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (key->scantid == NULL)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (heapTid == NULL)
+		return 1;
+
+	return ItemPointerCompare(key->scantid, heapTid);
+}
+
+/*
+ * _bt_lowest_scantid() -- Manufacture low heap TID.
+ *
+ *		Create a heap TID that's strictly less than any possible real heap
+ *		TID to _bt_tuple_compare.  This is still treated as greater than
+ *		minus infinity.  The overall effect is that _bt_search follows
+ *		downlinks with scankey equal non-TID attribute(s), but a
+ *		truncated-away TID attribute, as the scankey is greater than the
+ *		downlink/pivot tuple as a whole. (Obviously this can only be of use
+ *		when a scankey has values for all key attributes other than the heap
+ *		TID tie-breaker attribute/scantid.)
+ *
+ * If we didn't do this then affected index scans would have to
+ * unnecessarily visit an extra page before moving right to the page they
+ * should have landed on from the parent in the first place.  There would
+ * even be a useless binary search on the left/first page, since a high key
+ * check won't have the search move right immediately (the high key will be
+ * identical to the downlink we should have followed in the parent, barring
+ * a concurrent page split).
+ *
+ * This is particularly important with unique index insertions, since "the
+ * first page the value could be on" has an exclusive buffer lock held while
+ * a subsequent page (usually the actual first page the value could be on)
+ * has a shared buffer lock held.  (There may also be heap buffer locks
+ * acquired during this process.)
+ *
+ * Note that implementing this by adding hard-coding to _bt_compare is
+ * unworkable, since that would break nextkey semantics in the common case
+ * where all non-TID key attributes have been provided.
+ */
+ItemPointer
+_bt_lowest_scantid(void)
+{
+	static ItemPointer low = NULL;
+
+	/*
+	 * A heap TID that's less than or equal to any possible real heap TID
+	 * would also work
+	 */
+	if (!low)
+	{
+		low = &lowest;
+
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(low, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(low, InvalidOffsetNumber);
+	}
+
+	return low;
 }
 
 /*
@@ -575,8 +751,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData key;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ScanKey		scankeys;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -822,10 +999,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built in the scankeys[]
+	 * array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scankeys = key.scankeys;
+
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -1053,12 +1232,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/*
+	 * Initialize insertion scankey.
+	 *
+	 * Manufacture sentinel scan tid that's less than any possible heap TID in
+	 * the index when that might allow us to avoid unnecessary moves right
+	 * while descending the tree.
+	 *
+	 * Never do this for any nextkey case, since that would make _bt_search()
+	 * incorrectly land on the leaf page with the second user-attribute-wise
+	 * duplicate tuple, rather than landing on the leaf page with the next
+	 * user-attribute-distinct key > scankey, which is the intended behavior.
+	 * We could invent a _bt_highest_scantid() to use in nextkey cases, but
+	 * that would never actually save any cycles during the descent of the
+	 * tree; "_bt_binsrch() + nextkey = true" already behaves as if all tuples
+	 * <= scankey (in terms of the attributes/keys actually supplied in the
+	 * scankey) are < scankey.
+	 */
+	key.heapkeyspace = _bt_heapkeyspace(rel);
+	key.savebinsrch = key.restorebinsrch = false;
+	key.low = key.high = 0;
+	key.nextkey = nextkey;
+	key.keysz = keysCount;
+	key.scantid = NULL;
+	if (key.keysz >= IndexRelationGetNumberOfKeyAttributes(rel) &&
+		!key.nextkey && key.heapkeyspace)
+		key.scantid = _bt_lowest_scantid();
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &key, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1292,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &key, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..fa07390749 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -743,6 +743,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -796,8 +797,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -813,28 +812,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = IndexTupleSize(itup);
 	itupsz = MAXALIGN(itupsz);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
-	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -880,19 +859,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed by _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * can greatly improve space utilization for workloads with random
+			 * insertions, or insertions of monotonically increasing values at
+			 * "local" points in the key space.  It doesn't seem worthwhile to
+			 * add complex logic for choosing a split point here for a benefit
+			 * that is bound to be much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
@@ -905,7 +894,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup, true);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +916,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +963,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1022,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1115,7 +1109,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
+		pfree(indexScanKey);
 
 		for (;;)
 		{
@@ -1127,6 +1121,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1130,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1146,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 205457ef99..629066fcf9 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+				IndexTuple firstright, bool build);
 
 
 /*
@@ -56,34 +58,56 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one at some point if the pivot tuple actually came
+ *		from the tree, barring the minus infinity special case).
+ *
+ *		Note that we may occasionally have to share lock the metapage, in
+ *		order to determine whether or not the keys in the index are expected
+ *		to be unique (i.e. whether or not heap TID is treated as a tie-breaker
+ *		attribute).  Callers that cannot tolerate this can request that we
+ *		assume that all entries in the index are unique.
+ *
  *		The result is intended for use with _bt_compare().
  */
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)
 {
+	BTScanInsert res;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	res = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	res->heapkeyspace = assumeheapkeyspace || _bt_heapkeyspace(rel);
+	res->savebinsrch = res->restorebinsrch = false;
+	res->low = res->high = 0;
+	res->nextkey = false;
+	res->keysz = Min(indnkeyatts, tupnatts);
+	res->scantid = res->heapkeyspace ? BTreeTupleGetHeapTID(itup) : NULL;
+	skey = res->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +120,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple due
+		 * to suffix truncation.  Keys built from truncated attributes are
+		 * defensively represented as NULL values, though they should still
+		 * not participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,7 +145,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
+	return res;
 }
 
 /*
@@ -159,15 +196,6 @@ _bt_mkscankey_nodata(Relation rel)
 	return skey;
 }
 
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
-}
-
 /*
  * free a retracement stack made by _bt_search.
  */
@@ -2083,38 +2111,216 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to insertion scan key
+ * semantics).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare()/_bt_tuple_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
+ *
+ * CREATE INDEX callers must pass build = true, in order to avoid metapage
+ * access.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 bool build)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright, build);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Artificially force truncation to always append heap TID */
+	leavenatts = nkeyatts + 1;
+#endif
+
+	if (leavenatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away from an INCLUDE
+		 * index's pivot tuple.  They are not considered part of the key
+		 * space, so it's still necessary to add a heap TID attribute to the
+		 * new pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts < nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal (and
+		 * there are no non-key attributes).  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID attribute in the right item readily distinguishes the right
+	 * side of the split from the left side.  Use enlarged space that holds a
+	 * copy of first right tuple; place a heap TID value within the extra
+	 * space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 *
+	 * Callers generally try to avoid choosing a split point that necessitates
+	 * that we do this.  Splits of pages that only involve a single distinct
+	 * value (or set of values) must end up here, though.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/* Use last left item's heap TID */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on all current and future items on the right page
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  CREATE INDEX
+ * callers must pass build = true so that we may avoid metapage access.  (This
+ * is okay because CREATE INDEX always creates an index on the latest btree
+ * version.)
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+				bool build)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			leavenatts;
+	ScanKey		scankey;
+	BTScanInsert key;
+
+	key = _bt_mkscankey(rel, firstright, build);
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = key->scankeys;
+	leavenatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		isNull2 = (scankey->sk_flags & SK_ISNULL) != 0;
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											scankey->sk_argument)) != 0)
+			break;
+
+		leavenatts++;
+	}
+
+	/* Can't leak memory here */
+	pfree(key);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2343,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2363,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2373,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2384,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,8 +2417,73 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
 }
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, Page page, IndexTuple newtup)
+{
+	bool		needheaptidspace;
+	Size		itemsz;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, in which case a slightly higher
+	 * limit applies.
+	 */
+	needheaptidspace = _bt_heapkeyspace(rel);
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	if (needheaptidspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+						itemsz, BTREE_VERSION, BTMaxItemSize(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
+						itemsz, BTMaxItemSizeNoHeapTid(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..fe8f4fe2a7 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index ee7fd83c02..f675758bde 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ea495f1724..30340e9c02 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -97,7 +97,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 typedef struct BTMetaPageData
 {
 	uint32		btm_magic;		/* should contain BTREE_MAGIC */
-	uint32		btm_version;	/* should contain BTREE_VERSION */
+	uint32		btm_version;	/* should be >= BTREE_META_VERSION */
 	BlockNumber btm_root;		/* current root location */
 	uint32		btm_level;		/* tree level of the root page */
 	BlockNumber btm_fastroot;	/* current "fast" root location */
@@ -114,16 +114,27 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_META_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +214,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +255,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +272,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -319,6 +366,62 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples.  For details on its mutable
+ * state, see _bt_binsrch and _bt_findinsertloc.
+ *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search().
+ *
+ * keysz is the number of insertion scankeys present (scantid is counted
+ * separately).
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.   The array is
+ * used as a flexible array member, though it's sized in a way that makes it
+ * possible to use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state.  Used by _bt_binsrch() to inexpensively repeat a binary
+	 * search when only scantid has changed.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -541,6 +644,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -559,15 +663,15 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   BTScanInsert key,
 		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup,
+				  int ntupatts);
+extern ItemPointer _bt_lowest_scantid(void);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +680,9 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup,
+								  bool assumeheapkeyspace);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -600,8 +704,11 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, bool build);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap, Page page,
+					 IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..06da0965f7 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 1d12b01068..06fe44d39a 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..08cf72d670 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -167,6 +167,8 @@ BTArrayKeyInfo
 BTBuildState
 BTCycleId
 BTIndexStat
+BTInsertionKey
+BTInsertionKeyData
 BTLeader
 BTMetaPageData
 BTOneVacInfo
@@ -2207,6 +2209,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

v8-0001-Add-pg_depend-index-scan-tiebreaker-column.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Add-pg_depend-index-scan-tiebreaker-column.patchDownload

From 8592cf795cfc0315386136e3519e22be420b77d6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 13 Nov 2018 18:14:23 -0800
Subject: [PATCH v8 1/6] Add pg_depend index scan tiebreaker column.

findDependentObjects() and other code that scans pg_depend evolved to
depend on pg_depend_depender_index and pg_depend_reference_index scan
order.  This is clear from running the regression tests with
"ignore_system_indexes=on": much of the test output changes for
regression tests that happen to have DROP diagnostic messages.  More
importantly, a small number of those test failures involve alternative
messages that are objectively useless or even harmful.  These
regressions all involve internal dependencies of one form or another.
For example, we can go from a HINT that suggests dropping a partitioning
parent table's trigger in order to drop the child's trigger to a HINT
that suggests simply dropping the child table.  This HINT is technically
still correct from the point of view of findDependentObjects(), and yet
it's clearly not the intended user-visible behavior.

Make dependency.c take responsibility for its dependency on scan order
by commenting on it directly.  Ensure that the behavior of
findDependentObjects() is deterministic in the event of duplicates by
adding a new per-backend sequentially assigned number column to
pg_depend.  Both indexes now use this new column as a final trailing
attribute, effectively making it a tiebreaker wherever processing order
happens to matter.  The new column is a per-backend sequentially
assigned number.  New values are assigned in decreasing order.  That
produces the behavior that's already expected from nbtree scans in the
event of duplicate index entries (nbtree usually leaves duplicate index
entries in reverse insertion order at present).  A similar new column is
not needed for pg_shdepend because the aforementioned harmful changes
only occur with cases involving internal dependencies.

The overall effect is to stabilize the behavior of DROP diagnostic
messages, making it possible to avoid the "\set VERBOSITY=terse" hack
that has been used to paper over test instability in the past.  We may
wish to go through the regression tests and remove existing instances of
the "\set VERBOSITY=terse" hack (see also: commit 8e753726), but that's
left for later.

An upcoming patch to make nbtree store duplicate entries in a
well-defined order by treating heap TID as a tiebreaker tuple attribute
more or less flips the order that duplicates appear (the order will
change from approximately descending to perfectly ascending).  That
would cause all sorts of problems for findDependentObjects() callers if
this groundwork was skipped.

Note that adding the new column has no appreciable storage overhead.
pg_depend indexes are made no larger, at least on 64-bit platforms,
because values can fit in a hole that was previously unused due to
alignment -- both pg_depend_depender_index and pg_depend_reference_index
continue to have 24 byte IndexTuples.  There is also no change in the
on-disk size of pg_depend heap relations on 64-bit platforms, for the
same reason.  The MAXALIGN()'d size of pg_depend heap tuples remains 56
bytes (including tuple header overhead).

Discussion: https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
Discussion: https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us
---
 doc/src/sgml/catalogs.sgml                | 17 ++++++++-
 src/backend/catalog/dependency.c          | 10 ++++++
 src/backend/catalog/pg_depend.c           | 11 +++++-
 src/bin/initdb/initdb.c                   | 44 +++++++++++------------
 src/include/catalog/indexing.h            |  4 +--
 src/include/catalog/pg_depend.h           |  1 +
 src/test/regress/expected/alter_table.out |  2 +-
 src/test/regress/expected/misc_sanity.out |  4 +--
 8 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c134bca809..1cdd932c38 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2943,6 +2943,20 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>depcreate</structfield></entry>
+      <entry><type>int4</type></entry>
+      <entry></entry>
+      <entry>
+       A per-backend sequentially assigned number for this dependency
+       relationship.  Used as a tiebreaker in the event of multiple
+       internal dependency relationships of otherwise equal
+       precedence.  Identifiers are assigned in descending order to
+       ensure that the most recently entered dependency is the one
+       referenced by <literal>HINT</literal> fields.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -3060,7 +3074,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        that the system itself depends on the referenced object, and so
        that object must never be deleted.  Entries of this type are
        created only by <command>initdb</command>.  The columns for the
-       dependent object contain zeroes.
+       dependent object and the <structfield>depcreate</structfield>
+       column contain zeroes.
       </para>
      </listitem>
     </varlistentry>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dfa3278a5..d7889c2cd0 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -532,6 +532,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependDependerIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., objsubid = 0 entries will be
+	 * processed before other entries for the same dependent object.)
+	 */
 	scan = systable_beginscan(*depRel, DependDependerIndexId, true,
 							  NULL, nkeys, key);
 
@@ -727,6 +732,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependReferenceIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., refobjsubid = 0 entries will
+	 * be processed before other entries for the same referenced object.)
+	 */
 	scan = systable_beginscan(*depRel, DependReferenceIndexId, true,
 							  NULL, nkeys, key);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 2ea05f350b..1f609d4dc0 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -29,6 +29,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+/* Per-backend pg_depend tiebreaker value */
+static int32 depcreate = INT32_MAX;
 
 static bool isObjectPinned(const ObjectAddress *object, Relation rel);
 
@@ -93,7 +95,10 @@ recordMultipleDependencies(const ObjectAddress *depender,
 		{
 			/*
 			 * Record the Dependency.  Note we don't bother to check for
-			 * duplicate dependencies; there's no harm in them.
+			 * duplicate dependencies; there's no harm in them.  Note that
+			 * depcreate ensures deterministic processing among dependencies
+			 * of otherwise equal precedence (e.g., among multiple entries of
+			 * the same refclassid + refobjid + refobjsubid).
 			 */
 			values[Anum_pg_depend_classid - 1] = ObjectIdGetDatum(depender->classId);
 			values[Anum_pg_depend_objid - 1] = ObjectIdGetDatum(depender->objectId);
@@ -104,6 +109,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			values[Anum_pg_depend_refobjsubid - 1] = Int32GetDatum(referenced->objectSubId);
 
 			values[Anum_pg_depend_deptype - 1] = CharGetDatum((char) behavior);
+			values[Anum_pg_depend_depcreate - 1] = Int32GetDatum(depcreate--);
 
 			tup = heap_form_tuple(dependDesc->rd_att, values, nulls);
 
@@ -114,6 +120,9 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			CatalogTupleInsertWithInfo(dependDesc, tup, indstate);
 
 			heap_freetuple(tup);
+			/* avoid signed underflow */
+			if (depcreate == INT_MIN)
+				depcreate = INT_MAX;
 		}
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 211a96380e..3679c5f8aa 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1619,55 +1619,55 @@ setup_depend(FILE *cmdfd)
 		"DELETE FROM pg_shdepend;\n\n",
 		"VACUUM pg_shdepend;\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_class;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_proc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_type;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_cast;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_constraint;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_conversion;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_attrdef;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_language;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_operator;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opclass;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opfamily;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_am;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amop;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amproc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_rewrite;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_trigger;\n\n",
 
 		/*
 		 * restriction here to avoid pinning the public namespace
 		 */
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_namespace "
 		"    WHERE nspname LIKE 'pg%';\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_parser;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_dict;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_template;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_config;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_collation;\n\n",
 		"INSERT INTO pg_shdepend SELECT 0,0,0,0, tableoid,oid, 'p' "
 		" FROM pg_authid;\n\n",
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 2359b4c629..1ee7149a49 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -142,9 +142,9 @@ DECLARE_UNIQUE_INDEX(pg_database_datname_index, 2671, on pg_database using btree
 DECLARE_UNIQUE_INDEX(pg_database_oid_index, 2672, on pg_database using btree(oid oid_ops));
 #define DatabaseOidIndexId	2672
 
-DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops));
+DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops, depcreate int4_ops));
 #define DependDependerIndexId  2673
-DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops));
+DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops, depcreate int4_ops));
 #define DependReferenceIndexId	2674
 
 DECLARE_UNIQUE_INDEX(pg_description_o_c_o_index, 2675, on pg_description using btree(objoid oid_ops, classoid oid_ops, objsubid int4_ops));
diff --git a/src/include/catalog/pg_depend.h b/src/include/catalog/pg_depend.h
index 8f2d95210f..47cb387a6f 100644
--- a/src/include/catalog/pg_depend.h
+++ b/src/include/catalog/pg_depend.h
@@ -61,6 +61,7 @@ CATALOG(pg_depend,2608,DependRelationId)
 	 * field.  See DependencyType in catalog/dependency.h.
 	 */
 	char		deptype;		/* see codes in dependency.h */
+	int32		depcreate;		/* per-backend identifier; tiebreaker */
 } FormData_pg_depend;
 
 /* ----------------
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 38ede0a473..f9d4d38b3d 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2561,10 +2561,10 @@ DETAIL:  drop cascades to table alter2.t1
 drop cascades to view alter2.v1
 drop cascades to function alter2.plus1(integer)
 drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to text search parser alter2.prs
 drop cascades to text search configuration alter2.cfg
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 1d4b000acf..6ee8a9424f 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -18,8 +18,8 @@ WHERE refclassid = 0 OR refobjid = 0 OR
       deptype NOT IN ('a', 'e', 'i', 'n', 'p') OR
       (deptype != 'p' AND (classid = 0 OR objid = 0)) OR
       (deptype = 'p' AND (classid != 0 OR objid != 0 OR objsubid != 0));
- classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype 
----------+-------+----------+------------+----------+-------------+---------
+ classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype | depcreate 
+---------+-------+----------+------------+----------+-------------+---------+-----------
 (0 rows)
 
 -- **************** pg_shdepend ****************
-- 
2.17.1

#41

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Peter Geoghegan (#40)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v8 of the patch series, which has some relatively minor changes:

Thank you for working on this patch,

Just for the information, cfbot says there are problems on windows:

src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
undeclared identifier

#42

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Dmitry Dolgov (#41)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Just for the information, cfbot says there are problems on windows:

src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' :
undeclared identifier

Thanks. Looks like I should have used PG_INT32_MAX.

--
Peter Geoghegan

#43

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#42)

6 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Dec 1, 2018 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

Thanks. Looks like I should have used PG_INT32_MAX.

Attached is v9, which does things that way. There are no interesting
changes, though I have set things up so that a later patch in the
series can add "dynamic prefix truncation" -- I do not include any
such patch in v9, though. I'm going to start a new thread on that
topic, and include the patch there, since it's largely unrelated to
this work, and in any case still isn't in scope for Postgres 12 (the
patch is still experimental, for reasons that are of general
interest). If nothing else, Andrey and Peter E. will probably get a
better idea of why I thought that an insertion scan key was a good
place to put mutable state if they go read that other thread -- there
really was a bigger picture to setting things up that way.

--
Peter Geoghegan

Attachments:

v9-0006-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v9-0006-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 77585c46bb6d2b14867692334bbeeab3c7a7d15c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v9 6/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bee1f1c9d9..93660de557 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -242,6 +243,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -253,9 +255,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -264,6 +266,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -282,16 +286,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -365,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -396,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -481,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v9-0001-Add-pg_depend-index-scan-tiebreaker-column.patchapplication/x-patch; name=v9-0001-Add-pg_depend-index-scan-tiebreaker-column.patchDownload

From 85d8c3dd50f3042f74e3d4e8b0ab67b5b038375d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 13 Nov 2018 18:14:23 -0800
Subject: [PATCH v9 1/7] Add pg_depend index scan tiebreaker column.

findDependentObjects() and other code that scans pg_depend evolved to
depend on pg_depend_depender_index and pg_depend_reference_index scan
order.  This is clear from running the regression tests with
"ignore_system_indexes=on": much of the test output changes for
regression tests that happen to have DROP diagnostic messages.  More
importantly, a small number of those test failures involve alternative
messages that are objectively useless or even harmful.  These
regressions all involve internal dependencies of one form or another.
For example, we can go from a HINT that suggests dropping a partitioning
parent table's trigger in order to drop the child's trigger to a HINT
that suggests simply dropping the child table.  This HINT is technically
still correct from the point of view of findDependentObjects(), and yet
it's clearly not the intended user-visible behavior.

Make dependency.c take responsibility for its dependency on scan order
by commenting on it directly.  Ensure that the behavior of
findDependentObjects() is deterministic in the event of duplicates by
adding a new per-backend sequentially assigned number column to
pg_depend.  Both indexes now use this new column as a final trailing
attribute, effectively making it a tiebreaker wherever processing order
happens to matter.  The new column is a per-backend sequentially
assigned number.  New values are assigned in decreasing order.  That
produces the behavior that's already expected from nbtree scans in the
event of duplicate index entries (nbtree usually leaves duplicate index
entries in reverse insertion order at present).  A similar new column is
not needed for pg_shdepend because the aforementioned harmful changes
only occur with cases involving internal dependencies.

The overall effect is to stabilize the behavior of DROP diagnostic
messages, making it possible to avoid the "\set VERBOSITY=terse" hack
that has been used to paper over test instability in the past.  We may
wish to go through the regression tests and remove existing instances of
the "\set VERBOSITY=terse" hack (see also: commit 8e753726), but that's
left for later.

An upcoming patch to make nbtree store duplicate entries in a
well-defined order by treating heap TID as a tiebreaker tuple attribute
more or less flips the order that duplicates appear (the order will
change from approximately descending to perfectly ascending).  That
would cause all sorts of problems for findDependentObjects() callers if
this groundwork was skipped.

Note that adding the new column has no appreciable storage overhead.
pg_depend indexes are made no larger, at least on 64-bit platforms,
because values can fit in a hole that was previously unused due to
alignment -- both pg_depend_depender_index and pg_depend_reference_index
continue to have 24 byte IndexTuples.  There is also no change in the
on-disk size of pg_depend heap relations on 64-bit platforms, for the
same reason.  The MAXALIGN()'d size of pg_depend heap tuples remains 56
bytes (including tuple header overhead).

Discussion: https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
Discussion: https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us
---
 doc/src/sgml/catalogs.sgml                | 17 ++++++++-
 src/backend/catalog/dependency.c          | 10 ++++++
 src/backend/catalog/pg_depend.c           | 11 +++++-
 src/bin/initdb/initdb.c                   | 44 +++++++++++------------
 src/include/catalog/indexing.h            |  4 +--
 src/include/catalog/pg_depend.h           |  1 +
 src/test/regress/expected/alter_table.out |  2 +-
 src/test/regress/expected/misc_sanity.out |  4 +--
 8 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 18c38e42de..cef6e40cad 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2937,6 +2937,20 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>depcreate</structfield></entry>
+      <entry><type>int4</type></entry>
+      <entry></entry>
+      <entry>
+       A per-backend sequentially assigned number for this dependency
+       relationship.  Used as a tiebreaker in the event of multiple
+       internal dependency relationships of otherwise equal
+       precedence.  Identifiers are assigned in descending order to
+       ensure that the most recently entered dependency is the one
+       referenced by <literal>HINT</literal> fields.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -3054,7 +3068,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        that the system itself depends on the referenced object, and so
        that object must never be deleted.  Entries of this type are
        created only by <command>initdb</command>.  The columns for the
-       dependent object contain zeroes.
+       dependent object and the <structfield>depcreate</structfield>
+       column contain zeroes.
       </para>
      </listitem>
     </varlistentry>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dfa3278a5..d7889c2cd0 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -532,6 +532,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependDependerIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., objsubid = 0 entries will be
+	 * processed before other entries for the same dependent object.)
+	 */
 	scan = systable_beginscan(*depRel, DependDependerIndexId, true,
 							  NULL, nkeys, key);
 
@@ -727,6 +732,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependReferenceIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., refobjsubid = 0 entries will
+	 * be processed before other entries for the same referenced object.)
+	 */
 	scan = systable_beginscan(*depRel, DependReferenceIndexId, true,
 							  NULL, nkeys, key);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 2ea05f350b..d7a32c05f4 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -29,6 +29,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+/* Per-backend pg_depend tiebreaker value */
+static int32 depcreate = PG_INT32_MAX;
 
 static bool isObjectPinned(const ObjectAddress *object, Relation rel);
 
@@ -93,7 +95,10 @@ recordMultipleDependencies(const ObjectAddress *depender,
 		{
 			/*
 			 * Record the Dependency.  Note we don't bother to check for
-			 * duplicate dependencies; there's no harm in them.
+			 * duplicate dependencies; there's no harm in them.  Note that
+			 * depcreate ensures deterministic processing among dependencies
+			 * of otherwise equal precedence (e.g., among multiple entries of
+			 * the same refclassid + refobjid + refobjsubid).
 			 */
 			values[Anum_pg_depend_classid - 1] = ObjectIdGetDatum(depender->classId);
 			values[Anum_pg_depend_objid - 1] = ObjectIdGetDatum(depender->objectId);
@@ -104,6 +109,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			values[Anum_pg_depend_refobjsubid - 1] = Int32GetDatum(referenced->objectSubId);
 
 			values[Anum_pg_depend_deptype - 1] = CharGetDatum((char) behavior);
+			values[Anum_pg_depend_depcreate - 1] = Int32GetDatum(depcreate--);
 
 			tup = heap_form_tuple(dependDesc->rd_att, values, nulls);
 
@@ -114,6 +120,9 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			CatalogTupleInsertWithInfo(dependDesc, tup, indstate);
 
 			heap_freetuple(tup);
+			/* avoid signed underflow */
+			if (depcreate == PG_INT32_MIN)
+				depcreate = PG_INT32_MAX;
 		}
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 211a96380e..3679c5f8aa 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1619,55 +1619,55 @@ setup_depend(FILE *cmdfd)
 		"DELETE FROM pg_shdepend;\n\n",
 		"VACUUM pg_shdepend;\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_class;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_proc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_type;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_cast;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_constraint;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_conversion;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_attrdef;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_language;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_operator;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opclass;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opfamily;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_am;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amop;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amproc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_rewrite;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_trigger;\n\n",
 
 		/*
 		 * restriction here to avoid pinning the public namespace
 		 */
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_namespace "
 		"    WHERE nspname LIKE 'pg%';\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_parser;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_dict;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_template;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_config;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_collation;\n\n",
 		"INSERT INTO pg_shdepend SELECT 0,0,0,0, tableoid,oid, 'p' "
 		" FROM pg_authid;\n\n",
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 2359b4c629..1ee7149a49 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -142,9 +142,9 @@ DECLARE_UNIQUE_INDEX(pg_database_datname_index, 2671, on pg_database using btree
 DECLARE_UNIQUE_INDEX(pg_database_oid_index, 2672, on pg_database using btree(oid oid_ops));
 #define DatabaseOidIndexId	2672
 
-DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops));
+DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops, depcreate int4_ops));
 #define DependDependerIndexId  2673
-DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops));
+DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops, depcreate int4_ops));
 #define DependReferenceIndexId	2674
 
 DECLARE_UNIQUE_INDEX(pg_description_o_c_o_index, 2675, on pg_description using btree(objoid oid_ops, classoid oid_ops, objsubid int4_ops));
diff --git a/src/include/catalog/pg_depend.h b/src/include/catalog/pg_depend.h
index 8f2d95210f..47cb387a6f 100644
--- a/src/include/catalog/pg_depend.h
+++ b/src/include/catalog/pg_depend.h
@@ -61,6 +61,7 @@ CATALOG(pg_depend,2608,DependRelationId)
 	 * field.  See DependencyType in catalog/dependency.h.
 	 */
 	char		deptype;		/* see codes in dependency.h */
+	int32		depcreate;		/* per-backend identifier; tiebreaker */
 } FormData_pg_depend;
 
 /* ----------------
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 38ede0a473..f9d4d38b3d 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2561,10 +2561,10 @@ DETAIL:  drop cascades to table alter2.t1
 drop cascades to view alter2.v1
 drop cascades to function alter2.plus1(integer)
 drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to text search parser alter2.prs
 drop cascades to text search configuration alter2.cfg
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 1d4b000acf..6ee8a9424f 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -18,8 +18,8 @@ WHERE refclassid = 0 OR refobjid = 0 OR
       deptype NOT IN ('a', 'e', 'i', 'n', 'p') OR
       (deptype != 'p' AND (classid = 0 OR objid = 0)) OR
       (deptype = 'p' AND (classid != 0 OR objid != 0 OR objsubid != 0));
- classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype 
----------+-------+----------+------------+----------+-------------+---------
+ classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype | depcreate 
+---------+-------+----------+------------+----------+-------------+---------+-----------
 (0 rows)
 
 -- **************** pg_shdepend ****************
-- 
2.17.1

v9-0003-Pick-nbtree-split-points-discerningly.patchapplication/x-patch; name=v9-0003-Pick-nbtree-split-points-discerningly.patchDownload

From ac24f73659c3bb6352406379dd3627a868d8c4aa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v9 3/7] Pick nbtree split points discerningly.

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.  Use this
within _bt_findsplitloc() to weigh how effective suffix truncation will
be.  This is primarily useful because it maximizes the effectiveness of
suffix truncation.  This should not noticeably affect the balance of
free space within each half of the split.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held.  We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/README      |  84 +++-
 src/backend/access/nbtree/nbtinsert.c | 646 +++++++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  |  78 ++++
 src/include/access/nbtree.h           |   8 +-
 4 files changed, 745 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 75cb1d1e22..700b940b79 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -165,9 +165,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -669,6 +669,84 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already, without provoking a split.
+Choosing the split point between two index tuples with differences that
+appear as early as possible results in truncating away as many suffix
+attributes as possible.  An array of acceptable candidate split points
+(points that balance free space on either side of the split sufficiently
+well) is assembled in a pass over the page to be split, sorted by delta.
+An optimal split point is chosen during a pass over the assembled array.
+There are often several split points that allow the maximum number of
+attributes to be truncated away -- we choose whichever one has the lowest
+free space delta.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been without actually making any pivot tuple smaller due to restrictions
+relating to alignment.  The criteria for choosing a leaf page split point
+for suffix truncation is also predictive of future space utilization.
+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
+tuple that it's based on (the first item on the right page), since a heap
+TID must be appended when nothing else distinguishes each side of a leaf
+split.  Truncation cannot simply reuse the leaf level representation: we
+must append an additional attribute, rather than incorrectly leaving a heap
+TID in the generic IndexTuple item pointer field.  (The field is already
+used by pivot tuples to store their downlink, plus some additional
+metadata.)
+
+Adding a heap TID attribute during a leaf page split should only occur when
+the page to be split is entirely full of duplicates (the new item must also
+be a duplicate).  The logic for selecting a split point goes to great
+lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost
+always manages to pick a split point between two user-key-distinct tuples,
+accepting a completely lopsided split if it must.  Once appending a heap
+TID to a split's pivot becomes completely unavoidable, there is a fallback
+strategy --- "single value" mode is used, which makes page splits pack the
+new left half full by using a high fillfactor.  Single value mode leads to
+better overall space utilization when a large number of duplicates are the
+norm, and thereby also limits the total number of pivot tuples with an
+untruncated heap TID attribute.
+
+Avoiding appending a heap TID to a pivot tuple is about much more than just
+saving a single MAXALIGN() quantum in pivot tuples.  It's worth going out
+of our way to avoid having a single value (or single composition of key
+values) span two leaf pages when that isn't truly necessary, since that
+usually leads to many more index scans that must visit a second leaf page.
+In general, pivot tuples should describe the key space in a way that's
+likely to remain balanced over time.  It's important for the description of
+the key space within internal pages to be resistant to short-term duplicate
+bubbles (these are often caused by long-running transactions that make the
+HOT optimization less effective).  Aggressively avoiding heap TIDs in pivot
+tuples can often paradoxically improve leaf page space utilization over the
+long haul, especially when the number of page deletions performed by VACUUM
+is thereby increased (*completely* empty leaf pages are more likely when
+groups of duplicates align with page boundaries).  In any case, getting
+better locality of access for index scans is more important than getting
+optimal space utilization.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 402912db12..3011f89333 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,25 +28,44 @@
 
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
+/* _bt_findsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost empty */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
 
 typedef struct
 {
 	/* context data for _bt_checksplitloc */
+	SplitMode	mode;			/* strategy for deciding split point */
 	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
+	double		fillfactor;		/* needed for weighted splits */
+	int			goodenough;		/* good enough left/right space delta */
 	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
+	bool		is_weighted;	/* T if weighted (e.g. rightmost) split */
 	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	bool		hikeyheaptid;	/* T if high key will likely get heap TID */
 	int			leftspace;		/* space available for items on left page */
 	int			rightspace;		/* space available for items on right page */
 	int			olddataitemstotal;	/* space taken by old items */
 
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
 } FindSplitData;
 
 
@@ -76,12 +95,21 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
+				 SplitMode mode, OffsetNumber newitemoff,
+				 Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int  _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+				 OffsetNumber newitemoff, IndexTuple newitem,
+				 SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -994,8 +1022,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* Choose the split point */
-		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+		firstright = _bt_findsplitloc(rel, page, SPLIT_DEFAULT,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/*
@@ -1691,6 +1719,30 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * for it, we might find ourselves with too little room on the page that
  * it needs to go into!)
  *
+ * We also give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case.  There
+ * is still a modest benefit to choosing a split location while weighing
+ * suffix truncation: the resulting (untruncated) pivot tuples are
+ * nevertheless more predictive of future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
  * If the page is the rightmost page on its level, we instead try to arrange
  * to leave the left split page fillfactor% full.  In this way, when we are
  * inserting successively increasing keys (consider sequences, timestamps,
@@ -1699,6 +1751,16 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
  * that leaf and nonleaf pages use different fillfactors.
  *
+ * If called recursively in single value mode, we also try to arrange to
+ * leave the left split page fillfactor% full, though we arrange to use a
+ * fillfactor that's even more left-heavy than the fillfactor used for
+ * rightmost pages.  This greatly helps with space management in cases where
+ * tuples with the same attribute values span multiple pages.  Newly
+ * inserted duplicates will tend to have higher heap TID values, so we'll
+ * end up splitting to the right in the manner of ascending insertions of
+ * monotonically increasing values.  See nbtree/README for more information
+ * about suffix truncation, and how a split point is chosen.
+ *
  * We are passed the intended insert position of the new tuple, expressed as
  * the offsetnumber of the tuple it must go in front of.  (This could be
  * maxoff+1 if the tuple is to go at the end.)
@@ -1729,8 +1791,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
 				 Page page,
+				 SplitMode mode,
 				 OffsetNumber newitemoff,
 				 Size newitemsz,
+				 IndexTuple newitem,
 				 bool *newitemonleft)
 {
 	BTPageOpaque opaque;
@@ -1740,15 +1804,16 @@ _bt_findsplitloc(Relation rel,
 	FindSplitData state;
 	int			leftspace,
 				rightspace,
-				goodenough,
 				olddataitemstotal,
-				olddataitemstoleft;
+				olddataitemstoleft,
+				perfectpenalty;
 	bool		goodenoughfound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
 
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Total free space available on a btree page, after fixed overhead */
 	leftspace = rightspace =
@@ -1766,18 +1831,60 @@ _bt_findsplitloc(Relation rel,
 	/* Count up total space in data items without actually scanning 'em */
 	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
 
-	state.newitemsz = newitemsz;
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.mode = mode;
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.hikeyheaptid = (mode == SPLIT_SINGLE_VALUE);
 	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
+	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
+	{
+		if (state.mode != SPLIT_SINGLE_VALUE)
+		{
+			/* Only used on rightmost page */
+			state.fillfactor = RelationGetFillFactor(rel,
+													 BTREE_DEFAULT_FILLFACTOR) / 100.0;
+		}
+		else
+		{
+			state.fillfactor = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+			state.is_weighted = true;
+		}
+	}
 	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
+	{
+		Assert(state.mode == SPLIT_DEFAULT);
+		/* Only used on rightmost page */
+		state.fillfactor = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+
+	/*
+	 * Set limits on the split interval/number of candidate split points as
+	 * appropriate.  The "Prefix B-Trees" paper refers to this as sigma l for
+	 * leaf splits and sigma b for internal ("branch") splits.  It's hard to
+	 * provide a theoretical justification for the size of the split interval,
+	 * though it's clear that a small split interval improves space
+	 * utilization.
+	 *
+	 * (Also set interval for case when we split a page that has many
+	 * duplicates, or split a page that's entirely full of tuples of a single
+	 * value.  Future locality of access is prioritized over short-term space
+	 * utilization in these cases.)
+	 */
+	if (!state.is_leaf)
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	else if (state.mode == SPLIT_DEFAULT)
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	else if (state.mode == SPLIT_MANY_DUPLICATES)
+		state.maxsplits = maxoff + 2;
+	else
+		state.maxsplits = 1;
+	state.nsplits = 0;
+	if (state.mode != SPLIT_MANY_DUPLICATES)
+		state.splits = splits;
+	else
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+
 	state.leftspace = leftspace;
 	state.rightspace = rightspace;
 	state.olddataitemstotal = olddataitemstotal;
@@ -1786,13 +1893,15 @@ _bt_findsplitloc(Relation rel,
 	/*
 	 * Finding the best possible split would require checking all the possible
 	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as we find sufficiently-many "good-enough"
+	 * splits, where good-enough is defined as an imbalance in free space of
+	 * no more than pagesize/16 (arbitrary...) This should let us stop near
+	 * the middle on most pages, instead of plowing to the end.  Many
+	 * duplicates mode must consider all possible choices, and so does not use
+	 * this threshold for anything.
 	 */
-	goodenough = leftspace / 16;
+	state.goodenough = leftspace / 16;
 
 	/*
 	 * Scan through the data items and calculate space usage for a split at
@@ -1800,13 +1909,13 @@ _bt_findsplitloc(Relation rel,
 	 */
 	olddataitemstoleft = 0;
 	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
 
 	for (offnum = P_FIRSTDATAKEY(opaque);
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
 		Size		itemsz;
+		int			delta;
 
 		itemid = PageGetItemId(page, offnum);
 		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
@@ -1815,28 +1924,35 @@ _bt_findsplitloc(Relation rel,
 		 * Will the new item go to left or right of split?
 		 */
 		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
 
 		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		else
 		{
 			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
 
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
 		}
 
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
 			goodenoughfound = true;
+
+		/*
+		 * Abort scan once we've found a good-enough choice, and reach the
+		 * point where we stop finding new good-enough choices.  Don't do this
+		 * in many duplicates mode, though, since that has to be completely
+		 * exhaustive.
+		 */
+		if (goodenoughfound && state.mode != SPLIT_MANY_DUPLICATES &&
+			delta > state.goodenough)
 			break;
-		}
 
 		olddataitemstoleft += itemsz;
 	}
@@ -1846,19 +1962,50 @@ _bt_findsplitloc(Relation rel,
 	 * the old items go to the left page and the new item goes to the right
 	 * page.
 	 */
-	if (newitemoff > maxoff && !goodenoughfound)
+	if (newitemoff > maxoff &&
+		(!goodenoughfound || state.mode == SPLIT_MANY_DUPLICATES))
 		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
 
 	/*
 	 * I believe it is not possible to fail to find a feasible split, but just
 	 * in case ...
 	 */
-	if (!state.have_split)
+	if (state.nsplits == 0)
 		elog(ERROR, "could not find a feasible split point for index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to increase fan-out, by choosing a split point which is
+	 * amenable to being made smaller by suffix truncation, or is already
+	 * small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  This will be passed to _bt_bestsplitloc() if it
+	 * determines that candidate split points are good enough to finish
+	 * default mode split.  Perfect penalty saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Start second pass over page if _bt_perfect_penalty() told us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_findsplitloc(rel, page, secondmode, newitemoff, newitemsz,
+								newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
 }
 
 /*
@@ -1873,8 +2020,11 @@ _bt_findsplitloc(Relation rel,
  *
  * olddataitemstoleft is the total size of all old items to the left of
  * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
  */
-static void
+static int
 _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright,
 				  bool newitemonleft,
@@ -1882,7 +2032,8 @@ _bt_checksplitloc(FindSplitData *state,
 				  Size firstoldonrightsz)
 {
 	int			leftfree,
-				rightfree;
+				rightfree,
+				leftleafheaptidsz;
 	Size		firstrightitemsz;
 	bool		newitemisfirstonright;
 
@@ -1902,15 +2053,38 @@ _bt_checksplitloc(FindSplitData *state,
 
 	/*
 	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
 	 * index has included attributes, then those attributes of left page high
 	 * key will be truncated leaving that page with slightly more free space.
 	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
 	 */
 	leftfree -= firstrightitemsz;
 
+	/*
+	 * Assume that suffix truncation cannot avoid adding a heap TID to the
+	 * left half's new high key when splitting at the leaf level.  Don't let
+	 * this impact the balance of free space in the common case where adding a
+	 * heap TID is considered very unlikely, though, since there is no reason
+	 * to accept a likely-suboptimal split.
+	 *
+	 * When adding a heap TID seems likely, then actually factor that in to
+	 * delta calculation, rather than just having it as a constraint on
+	 * whether or not a split is acceptable.
+	 */
+	leftleafheaptidsz = 0;
+	if (state->is_leaf)
+	{
+		if (!state->hikeyheaptid)
+			leftleafheaptidsz = sizeof(ItemPointerData);
+		else
+			leftfree -= (int) sizeof(ItemPointerData);
+	}
+
 	/* account for the new item */
 	if (newitemonleft)
 		leftfree -= (int) state->newitemsz;
@@ -1926,20 +2100,23 @@ _bt_checksplitloc(FindSplitData *state,
 			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
 
 	/*
-	 * If feasible split point, remember best delta.
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
 	 */
-	if (leftfree >= 0 && rightfree >= 0)
+	if (leftfree - leftleafheaptidsz >= 0 && rightfree >= 0)
 	{
 		int			delta;
 
-		if (state->is_rightmost)
+		if (state->is_weighted)
 		{
 			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
+			 * If splitting a rightmost page, or in single value mode, try to
+			 * put (100-fillfactor)% of free space on left page. See comments
+			 * for _bt_findsplitloc.
 			 */
 			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
+				- ((1.0 - state->fillfactor) * rightfree);
 		}
 		else
 		{
@@ -1949,14 +2126,349 @@ _bt_checksplitloc(FindSplitData *state,
 
 		if (delta < 0)
 			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway.
+		 *
+		 * We cannot do this in many duplicates mode, because that hurts cases
+		 * where there are a small number of available distinguishing split
+		 * points, and consistently picking the least worst choice among them
+		 * matters. (e.g., a non-unique index whose leaf pages each contain a
+		 * small number of distinct values, with each value duplicated a
+		 * uniform number of times.)
+		 */
+		if (delta > state->goodenough && state->mode != SPLIT_MANY_DUPLICATES)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
 		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/*
+	 * No point in calculating penalty when there's only one choice.  Note
+	 * that single value mode always has one choice.
+	 */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	Assert(state->mode == SPLIT_DEFAULT ||
+		   state->mode == SPLIT_MANY_DUPLICATES);
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
 		}
 	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 *
+	 * Single value mode splits only occur when appending a heap TID was
+	 * already deemed necessary.  Don't waste any more cycles trying to avoid
+	 * it.
+	 */
+	if (state->mode == SPLIT_MANY_DUPLICATES)
+		return indnkeyatts;
+	else if (state->mode == SPLIT_SINGLE_VALUE)
+		return indnkeyatts + 1;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.  It's important that
+	 * many duplicates mode has every opportunity to avoid picking a split
+	 * point that requires that suffix truncation append a heap TID to new
+	 * pivot tuple.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = indnkeyatts;
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller must perform when even a
+	 * "perfect" penalty fails to avoid appending a heap TID to new pivot
+	 * tuple
+	 */
+	if (perfectpenalty > indnkeyatts)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			outerpenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (P_FIRSTDATAKEY(opaque) == newitemoff)
+			leftmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+			leftmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		if (newitemoff > maxoff)
+			rightmost = newitem;
+		else
+		{
+			itemid = PageGetItemId(page, maxoff);
+			rightmost = (IndexTuple) PageGetItem(page, itemid);
+		}
+
+		Assert(leftmost != rightmost);
+		outerpenalty = _bt_leave_natts_fast(rel, leftmost, rightmost);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates and it appears that the duplicates have been
+		 * inserted in sequential order (i.e. heap TID order), a single value
+		 * mode pass will be performed.
+		 *
+		 * Instruct caller to continue with original default mode split when
+		 * new duplicate item would not go at the end of the page.
+		 * Out-of-order duplicate insertions predict further inserts towards
+		 * the left/middle of the original page's keyspace.  Evenly sharing
+		 * space among each half of the split avoids pathological performance.
+		 */
+		if (outerpenalty <= indnkeyatts)
+			*secondmode = SPLIT_MANY_DUPLICATES;
+		else if (maxoff < newitemoff)
+			*secondmode = SPLIT_SINGLE_VALUE;
+		else
+			*secondmode = SPLIT_DEFAULT;
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID needs to be appending during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	firstright;
+	IndexTuple	lastleft;
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		OffsetNumber lastleftoff;
+
+		lastleftoff = OffsetNumberPrev(split->firstright);
+		itemid = PageGetItemId(page, lastleftoff);
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastleft != firstright);
+	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c0d46e2beb..2c1be82acb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2378,6 +2379,83 @@ _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return leavenatts;
 }
 
+/*
+ * _bt_leave_natts_fast - fast, approximate variant of _bt_leave_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_leave_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_leave_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases. However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_leave_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_leave_natts(rel, lastleft, firstright, false);
+#endif
+
+	leavenatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		leavenatts++;
+	}
+
+	return leavenatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4f66ab5845..d7fa9e8c49 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -144,11 +144,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -708,6 +712,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, bool build);
+extern int _bt_leave_natts_fast(Relation rel, IndexTuple lastleft,
+					 IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 		   bool needheaptidspace, Page page, IndexTuple newtup);
-- 
2.17.1

v9-0004-Add-split-at-new-tuple-page-split-optimization.patchapplication/x-patch; name=v9-0004-Add-split-at-new-tuple-page-split-optimization.patchDownload

From 12fab394260956b041fd601803d61829fcf5c7c0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v9 4/7] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing cases where a newly inserted tuple has a heap TID
that's slightly greater than that of the existing tuple to the immediate
left, but isn't just a duplicate.  It can greatly help space utilization
to split between two groups of localized monotonically increasing
values.

This is very similar to the long established fillfactor optimization
used during rightmost page splits, where we usually leave the new left
side of the split 90% full.  Split-at-new-tuple page splits target
essentially the same case, except that the splits are at the rightmost
point of a localized grouping of values, rather than at the rightmost
point of the entire key space.

This enhancement has been demonstrated to be very effective at avoiding
index bloat when initial bulk INSERTs for the TPC-C benchmark are run.
Evidently, the primary keys for all of the largest indexes in the TPC-C
schema are populated through localized, monotonically increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.  (Note that the patch series generally has very
little advantage over master if the indexes are rebuilt via a REINDEX,
with or without this later commit.)

I (Peter Geoghegan) will provide reviewers with a convenient copy of
this test data if asked.  It comes from the oltpbench fair-use
implementation of TPC-C [1], but the same issue has independently been
observed with the BenchmarkSQL implementation of TPC-C [2].

Note that this commit also recognizes and prevents bloat with
monotonically *decreasing* tuple insertions (e.g., single-DESC-attribute
index on a date column).  Affected cases will typically leave their
index structure slightly smaller than an equivalent monotonically
increasing case would.

[1] http://oltpbenchmark.com
[2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
---
 src/backend/access/nbtree/nbtinsert.c | 186 +++++++++++++++++++++++++-
 1 file changed, 184 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 3011f89333..52c274eca6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -100,6 +100,8 @@ static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_dosplitatnewitem(Relation rel, Page page,
+					 OffsetNumber newitemoff, IndexTuple newitem);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -110,6 +112,7 @@ static int  _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
 				 SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -1749,7 +1752,13 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * etc) we will end up with a tree whose pages are about fillfactor% full,
  * instead of the 50% full result that we'd get without this special case.
  * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * the fillfactor% is determined dynamically when _bt_dosplitatnewitem()
+ * indicates that there are localized monotonically increasing insertions,
+ * or monotonically decreasing (DESC order) insertions. (This can only
+ * happen with the default strategy, and should be thought of as a variant
+ * of the fillfactor% special case that is applied only when inserting into
+ * non-rightmost pages.)
  *
  * If called recursively in single value mode, we also try to arrange to
  * leave the left split page fillfactor% full, though we arrange to use a
@@ -1839,7 +1848,28 @@ _bt_findsplitloc(Relation rel,
 	state.is_weighted = P_RIGHTMOST(opaque);
 	if (state.is_leaf)
 	{
-		if (state.mode != SPLIT_SINGLE_VALUE)
+		/*
+		 * Consider split at new tuple optimization.  See
+		 * _bt_dosplitatnewitem() for an explanation.
+		 */
+		if (state.mode == SPLIT_DEFAULT && !P_RIGHTMOST(opaque) &&
+			_bt_dosplitatnewitem(rel, page, newitemoff, newitem))
+		{
+			/*
+			 * fillfactor% is dynamically set through interpolation of the
+			 * new/incoming tuple's offset position
+			 */
+			if (newitemoff > maxoff)
+				state.fillfactor = (double) BTREE_DEFAULT_FILLFACTOR / 100.0;
+			else if (newitemoff == P_FIRSTDATAKEY(opaque))
+				state.fillfactor = (double) BTREE_MIN_FILLFACTOR / 100.0;
+			else
+				state.fillfactor =
+					((double) newitemoff / (((double) maxoff + 1)));
+
+			state.is_weighted = true;
+		}
+		else if (state.mode != SPLIT_SINGLE_VALUE)
 		{
 			/* Only used on rightmost page */
 			state.fillfactor = RelationGetFillFactor(rel,
@@ -2178,6 +2208,126 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at
+ * approximately the point that the new/incoming item would have been
+ * inserted.
+ *
+ * This routine infers two distinct cases in which splitting around the new
+ * item's insertion point is likely to lead to better space utilization over
+ * time:
+ *
+ * - Composite indexes that consist of one or more leading columns that
+ *   describe some grouping, plus a trailing, monotonically increasing
+ *   column.  If there happened to only be one grouping then the traditional
+ *   rightmost page split default fillfactor% would be used to good effect,
+ *   so it seems worth recognizing this case.  This usage pattern is
+ *   prevalent in the TPC-C benchmark, and is assumed to be common in real
+ *   world applications.
+ *
+ * - DESC-ordered insertions, including DESC-ordered single (non-heap-TID)
+ *   key attribute indexes.  We don't want the performance of explicitly
+ *   DESC-ordered indexes to be out of line with an equivalent ASC-ordered
+ *   index.  Also, there may be organic cases where items are continually
+ *   inserted in DESC order for an index with ASC sort order.
+ *
+ * Caller uses fillfactor% rather than using the new item offset directly
+ * because it allows suffix truncation to be applied using the usual
+ * criteria, which can still be helpful.  This approach is also more
+ * maintainable, since restrictions on split points can be handled in the
+ * usual way.
+ *
+ * Localized insert points are inferred here by observing that neighboring
+ * heap TIDs are "adjacent".  For example, if the new item has distinct key
+ * attribute values to the existing item that belongs to its immediate left,
+ * and the item to its left has a heap TID whose offset is exactly one less
+ * than the new item's offset, then caller is told to use its new-item-split
+ * strategy.  It isn't of much consequence if this routine incorrectly
+ * infers that an interesting case is taking place, provided that that
+ * doesn't happen very often.  In particular, it should not be possible to
+ * construct a test case where the routine consistently does the wrong
+ * thing.  Since heap TID "adjacency" is such a delicate condition, and
+ * since there is no reason to imagine that random insertions should ever
+ * consistent leave new tuples at the first or last position on the page
+ * when a split is triggered, that will never happen.
+ *
+ * Note that we avoid using the split-at-new fillfactor% when we'd have to
+ * append a heap TID during suffix truncation.  We also insist that there
+ * are no varwidth attributes or NULL attribute values in new item, since
+ * that invalidates interpolating from the new item offset.  Besides,
+ * varwidths generally imply the use of datatypes where ordered insertions
+ * are not a naturally occurring phenomenon.
+ */
+static bool
+_bt_dosplitatnewitem(Relation rel, Page page, OffsetNumber newitemoff,
+					 IndexTuple newitem)
+{
+	ItemId		itemid;
+	OffsetNumber maxoff;
+	BTPageOpaque opaque;
+	IndexTuple	tup;
+	int16		nkeyatts;
+
+	if (IndexTupleHasNulls(newitem) || IndexTupleHasVarwidths(newitem))
+		return false;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Avoid optimization entirely on pages with large items */
+	if (maxoff <= 3)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/*
+	 * When heap TIDs appear in DESC order, consider left-heavy split.
+	 *
+	 * Accept left-heavy split when new item, which will be inserted at first
+	 * data offset, has adjacent TID to extant item at that position.
+	 */
+	if (newitemoff == P_FIRSTDATAKEY(opaque))
+	{
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/* Single key indexes only use DESC optimization */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * When tuple heap TIDs appear in ASC order, consider right-heavy split,
+	 * even though this may not be the right-most page.
+	 *
+	 * Accept right-heavy split when new item, which belongs after any
+	 * existing page offset, has adjacent TID to extant item that's the last
+	 * on the page.
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+
+		return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+			_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+	}
+
+	/*
+	 * When new item is approximately in the middle of the page, look for
+	 * adjacency among new item, and extant item that belongs to the left of
+	 * the new item in the keyspace.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+
+	return _bt_adjacenthtid(&tup->t_tid, &newitem->t_tid) &&
+		_bt_leave_natts_fast(rel, tup, newitem) <= nkeyatts;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -2471,6 +2621,38 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	return _bt_leave_natts_fast(rel, lastleft, firstright);
 }
 
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction, and probably not through heap_update().  This is not a
+ * commutative condition.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+	OffsetNumber lowoff,
+				highoff;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+	lowoff = ItemPointerGetOffsetNumber(lowhtid);
+	highoff = ItemPointerGetOffsetNumber(highhtid);
+
+	/* When heap blocks match, second offset should be one up */
+	if (lowblk == highblk && OffsetNumberNext(lowoff) == highoff)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk && highoff == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
-- 
2.17.1

v9-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchapplication/x-patch; name=v9-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchDownload

From bcc27562613a013d42e15cd266498d2ded6ada2a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v9 2/7] Treat heap TID as part of the nbtree key space.

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID, at least in principle.  Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).  This general approach has numerous benefits for
performance, and is prerequisite to teaching VACUUM to perform "retail
index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially when there are several attributes in an index.  Truncation
can only occur at the attribute granularity, which isn't particularly
effective, but works well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with BTREE_VERSIONs 2 and
3, while also enforcing the newer/more strict invariants with
BTREE_VERSION 4 indexes.

We no longer allow a search for free space among multiple pages full of
duplicates to "get tired", except when needed to preserve compatibility
with earlier versions.  This has significant benefits for free space
management in secondary indexes on low cardinality attributes.  However,
without the next commit in the patch series (without having "single
value" mode and "many duplicates" mode within _bt_findsplitloc()), these
cases will be significantly regressed, since they'll naively perform
50:50 splits without there being any hope of reusing space left free on
the left half of the split.
---
 contrib/amcheck/verify_nbtree.c              | 366 ++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 164 ++++--
 src/backend/access/nbtree/nbtinsert.c        | 590 ++++++++++++-------
 src/backend/access/nbtree/nbtpage.c          | 197 +++++--
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 374 +++++++++---
 src/backend/access/nbtree/nbtsort.c          |  90 +--
 src/backend/access/nbtree/nbtutils.c         | 423 +++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  17 +-
 src/include/access/nbtree.h                  | 163 ++++-
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/tools/pgindent/typedefs.list             |   4 +
 22 files changed, 1845 insertions(+), 656 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2855..6e473a3ac8 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,13 @@ PG_MODULE_MAGIC;
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
 /*
  * State associated with verifying a B-Tree index
  *
@@ -125,26 +132,28 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
+static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -834,8 +843,9 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -860,7 +870,6 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn),
 					 errhint("This could be a torn page problem.")));
-
 		/* Check the number of index tuple attributes */
 		if (!_bt_check_natts(state->rel, state->target, offset))
 		{
@@ -902,8 +911,66 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		skey = _bt_mkscankey(state->rel, itup, false);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a MAXALIGN() quantum of space from BTMaxItemSize() in order to
+		 * ensure that suffix truncation always has enough space to add an
+		 * explicit heap TID back to a tuple -- we pessimistically assume that
+		 * every newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since a MAXALIGN() quantum is reserved for that purpose, we must
+		 * not enforce the slightly lower limit when the extra quantum has
+		 * been used as intended.  In other words, there is only a
+		 * cross-version difference in the limit on tuple size within leaf
+		 * pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra quantum for
+		 * its designated purpose.  Enforce the lower limit for pivot tuples
+		 * when an explicit heap TID isn't actually present. (In all other
+		 * cases suffix truncation is guaranteed to generate a pivot tuple
+		 * that's no larger than the first right tuple provided to it by its
+		 * caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -928,9 +995,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+								  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +1049,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1109,20 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			IndexTuple	righttup;
+			BTScanInsert rightkey;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+				rightkey = _bt_mkscankey(state->rel, righttup, false);
+
+			if (righttup && !invariant_g_offset(state, rightkey, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1165,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1083,9 +1179,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1194,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1383,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1304,8 +1399,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1449,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1403,15 +1499,29 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1751,6 +1861,63 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search.  In practice, this
+	 * behavior is equivalent to an explicit negative infinity representation
+	 * within nbtree.  We care about the distinction between strict and
+	 * non-strict bounds, though, and so must consider truncated/negative
+	 * infinity attributes explicitly.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1759,57 +1926,107 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  That could cause
+	 * us to miss the fact that the scankey is less than rather than equal to
+	 * its lower bound, but the index is corrupt either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search.  In practice, this
+	 * behavior is equivalent to an explicit negative infinity representation
+	 * within nbtree.  We care about the distinction between strict and
+	 * non-strict bounds, though, and so must consider truncated/negative
+	 * infinity attributes explicitly.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1965,3 +2182,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 184ac62255..bee1f1c9d9 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -560,7 +560,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_META_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index aa52a96259..7a18da2e97 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..75cb1d1e22 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,60 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
+pointer order.  We don't use btree keys to disambiguate downlinks from the
+internal pages during a page split, though: only one entry in the parent
+level will be pointing at the page we just split, so the link fields can be
+used to re-find downlinks in the parent via a linear search.  (This is
+actually a legacy of when heap TID was not treated as part of the keyspace,
+but it does no harm to keep things that way.)
+
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to distinguish tuples which don't point to heap tuples, that are
+used only for tree navigation.  Pivot tuples include all tuples on
+non-leaf pages and high keys on leaf pages.  Note that pivot tuples are
+only used to represent which part of the key space belongs on each page,
+and can have attribute values copied from non-pivot tuples that were
+deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
+"separator" key and downlink, just a separator key (in practice the
+downlink will be garbage), or just a downlink.  We aren't always clear on
+which case applies, but it should be obvious from context.
+
+Lehman and Yao require that the key range for a subtree S is described by
+Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
+A search where the scan key is equal to a pivot tuple in an upper tree
+level must descend to the left of that pivot to ensure it finds any equal
+keys.  Pivot tuples are always a _strict_ lower bound on items on their
+downlink page; the equal item(s) being searched for must therefore be to
+the left of that downlink page on the next level down.  (It's possible to
+arrange for internal page tuples to be strict lower bounds in all cases
+because their values come from leaf tuples, which are guaranteed unique by
+the use of heap TID as a tiebreaker.  We also make use of hard-coded
+negative infinity values in internal pages.  Rightmost pages don't have a
+high key, though they conceptually have a positive infinity high key).  A
+handy property of this design is that there is never any need to
+distinguish between equality in the case where all attributes/keys are used
+in a scan from equality where only some prefix is used.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -598,33 +621,53 @@ the order of multiple keys for a given column is unspecified.)  An
 insertion scankey uses the same array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is exactly one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.  A truncated tuple logically retains all key attributes, though
+they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +680,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +707,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 582e5b0652..402912db12 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -52,19 +52,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool uniqueindex,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -72,7 +72,7 @@ static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   bool split_only_page);
 static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+		  IndexTuple newitem, bool newitemonleft, bool truncate);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -84,8 +84,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -111,18 +111,21 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
+	Page		page;
+	BTPageOpaque lpageop;
 	bool		fastpath;
 
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	Assert(IndexRelationGetNumberOfKeyAttributes(rel) != 0);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_scankey = _bt_mkscankey(rel, itup, false);
+top:
+	/* Cannot use real heap TID in unique case -- it'll be restored later */
+	if (itup_scankey->heapkeyspace && checkUnique != UNIQUE_CHECK_NO)
+		itup_scankey->scantid = _bt_lowest_scantid();
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -143,14 +146,10 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 * other backend might be concurrently inserting into the page, thus
 	 * reducing our chances to finding an insertion place in this page.
 	 */
-top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -174,14 +173,17 @@ top:
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
 			 * free space to accommodate the new tuple, and the insertion scan
-			 * key is strictly greater than the first key on the page.
+			 * key is strictly greater than the first key on the page.  The
+			 * itup_scankey scantid value may prevent the optimization from
+			 * being applied despite being safe when it was temporarily set to
+			 * a sentinel low value, though only when the page is full of
+			 * duplicates.
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +222,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -231,12 +232,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -249,9 +251,24 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * Arrange for the later _bt_findinsertloc call to _bt_binsrch to
+		 * avoid repeating the work done during this initial _bt_binsrch call.
+		 * Clear the _bt_lowest_scantid-supplied scantid value first, though,
+		 * so that the itup_scankey-cached low and high bounds will enclose a
+		 * range of offsets in the event of multiple duplicates. (Our
+		 * _bt_binsrch call cannot be allowed to incorrectly enclose a single
+		 * offset: the offset of the first duplicate among many on the page.)
+		 */
+		itup_scankey->scantid = NULL;
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -274,10 +291,16 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_scankey->heapkeyspace)
+			itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -288,10 +311,11 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		/* do the insertion, possibly on a page to the right in unique case */
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf,
+									  checkUnique != UNIQUE_CHECK_NO, itup,
+									  stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -302,7 +326,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -327,13 +351,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -344,6 +367,10 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
 
+	/* Fast path for case where there are clearly no duplicates */
+	if (itup_scankey->low >= itup_scankey->high)
+		return InvalidTransactionId;
+
 	InitDirtySnapshot(SnapshotDirty);
 
 	page = BufferGetPage(buf);
@@ -392,7 +419,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -553,11 +580,23 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/*
+			 * If scankey <= hikey (leaving out the heap TID attribute), we
+			 * gotta check the next page too.
+			 *
+			 * We cannot get away with giving up without going to the next
+			 * page when true key values are all == hikey, because heap TID is
+			 * ignored when considering duplicates (caller is sure to not
+			 * provide a scantid in scankey).  We could get away with this in
+			 * a hypothetical world where unique indexes certainly never
+			 * contain physical duplicates, since heap TID would never be
+			 * treated as part of the keyspace --- not here, and not at any
+			 * other point.
+			 */
+			Assert(itup_scankey->scantid == NULL);
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			if (_bt_compare(rel, itup_scankey, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -599,52 +638,53 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool uniqueindex,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = uniqueindex;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -652,77 +692,66 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
+	/* Check 1/3 of a page restriction */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, itup_scankey->heapkeyspace, page,
+							 newtup);
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_scankey->heapkeyspace || itup_scankey->scantid != NULL);
+	Assert(itup_scankey->heapkeyspace || itup_scankey->scantid == NULL);
+	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
+		int			cmpval;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * No need to check high key when inserting into a non-unique index --
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required.  Insertion scankey's scantid would have been
+		 * filled out at the time.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+		if (itup_scankey->heapkeyspace && !uniqueindex)
 		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
+			Assert(P_RIGHTMOST(lpageop) ||
+				   _bt_compare(rel, itup_scankey, page, P_HIKEY) <= 0);
+			break;
 		}
 
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (P_RIGHTMOST(lpageop))
 			break;
+		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+		if (itup_scankey->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -731,6 +760,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -745,7 +776,10 @@ _bt_findinsertloc(Relation rel,
 			 * If this page was incompletely split, finish the split now. We
 			 * do this while holding a lock on the left sibling, which is not
 			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
 			 */
 			if (P_INCOMPLETE_SPLIT(lpageop))
 			{
@@ -764,27 +798,98 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform micro-vacuuming of the page we're about to insert tuple on if
+	 * it looks like it has LP_DEAD items.  Only micro-vacuum when it might
+	 * forestall a page split, though.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+	Assert(!itup_scankey->restorebinsrch);
+	/* XXX: may use too many cycles to be a simple assertion */
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one page, a new tuple of that same value
+ *		could legally be placed on any one of the pages.  This function
+ *		handles the question of whether or not an insertion of a duplicate
+ *		into a pg_upgrade'd !heapkeyspace index should insert on the page
+ *		contained in buf when that's a legal choice.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move to page to the right.  Caller calls
+ *		here again if that next page isn't where the duplicates end, and
+ *		another choice must be made.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	/*
+	 * Perform micro-vacuuming of the page if it looks like it has LP_DEAD
+	 * items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * _bt_findinsertloc() may need to split the page to put the item on
+	 * this page, check whether we can put the tuple somewhere to the
+	 * right, instead.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -833,6 +938,8 @@ _bt_insertonpg(Relation rel,
 	BTPageOpaque lpageop;
 	OffsetNumber firstright = InvalidOffsetNumber;
 	Size		itemsz;
+	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -840,12 +947,9 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
-	Assert(!P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
-	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
+	Assert(!P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) == indnatts);
+	Assert(P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) <= indnkeyatts);
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -867,6 +971,7 @@ _bt_insertonpg(Relation rel,
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
+		bool		truncate;
 		bool		newitemonleft;
 		Buffer		rbuf;
 
@@ -893,9 +998,16 @@ _bt_insertonpg(Relation rel,
 									  newitemoff, itemsz,
 									  &newitemonleft);
 
+		/*
+		 * Perform truncation of the new high key for the left half of the
+		 * split when splitting a leaf page.  Don't do so with version 3
+		 * indexes unless the index has non-key attributes.
+		 */
+		truncate = P_ISLEAF(lpageop) &&
+			(_bt_heapkeyspace(rel) || indnatts != indnkeyatts);
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+						 newitemoff, itemsz, itup, newitemonleft, truncate);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -977,7 +1089,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_META_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1032,6 +1144,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_META_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1097,7 +1211,10 @@ _bt_insertonpg(Relation rel,
  *		On entry, buf is the page to split, and is pinned and write-locked.
  *		firstright is the item index of the first item to be moved to the
  *		new right page.  newitemoff etc. tell us about the new item that
- *		must be inserted along with the data from the old page.
+ *		must be inserted along with the data from the old page.  truncate
+ *		tells us if the new high key should undergo suffix truncation.
+ *		(Version 4 pivot tuples always have an explicit representation of
+ *		the number of non-truncated attributes that remain.)
  *
  *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
  *		page we're inserting the downlink for.  This function will clear the
@@ -1109,7 +1226,7 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+		  bool newitemonleft, bool truncate)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1132,8 +1249,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1203,7 +1318,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1217,8 +1334,14 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page.  Despite appearances, the new high key is generated in a way
+	 * that's consistent with their approach.  See comments above
+	 * _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1236,25 +1359,60 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (truncate)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		Assert(isleaf);
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, false);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1447,7 +1605,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1476,22 +1633,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1509,9 +1654,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1564,6 +1707,24 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
  * righthand page, plus a boolean indicating whether the new tuple goes on
  * the left or right page.  The bool is necessary to disambiguate the case
  * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key.  It isn't, though;
+ * suffix truncation will leave the left page's high key equal to the last
+ * item on the left page when two tuples with equal key values enclose the
+ * split point.  It's convenient to always express a split point as a
+ * firstright offset due to internal page splits, which leave us with a
+ * right half whose first item becomes a negative infinity item through
+ * truncation to 0 attributes.  In effect, internal page splits store
+ * firstright's "separator" key at the end of the left page (as left's new
+ * high key), and store its downlink at the start of the right page.  In
+ * other words, internal page splits conceptually split in the middle of the
+ * firstright tuple, not on either side of it.  Crucially, when splitting
+ * either a leaf page or an internal page, the new high key will be strictly
+ * less than the first item on the right page in all cases, despite the fact
+ * that we start with the assumption that firstright becomes the new high
+ * key.
  */
 static OffsetNumber
 _bt_findsplitloc(Relation rel,
@@ -1874,7 +2035,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -2164,7 +2325,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2199,7 +2360,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2232,6 +2394,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2296,6 +2460,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2311,28 +2476,25 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4082103fe2..171427c94f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *	3, the last version that can be updated without broadly affecting on-disk
+ *	compatibility.  (A REINDEX is required to upgrade to version 4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_META_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_META_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_META_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to BTREE_VERSION/version 4 without a
+		 * REINDEX, since extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_META_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_META_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_META_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_META_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_META_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1370,7 +1439,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1420,12 +1489,20 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_scankey = _bt_mkscankey(rel, targetkey, false);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
+
+				/*
+				 * Search will reliably relocate same leaf page.
+				 *
+				 * (However, prior to version 4 the search is for the leftmost
+				 * leaf page containing this key, which is okay because we
+				 * will tiebreak on downlink block number.)
+				 */
+				Assert(!itup_scankey->heapkeyspace ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
@@ -1970,7 +2047,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_META_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2018,6 +2095,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_META_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..9cf760ffa0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_META_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 16223d01ec..5d9cf856f8 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,10 @@
 #include "utils/tqual.h"
 
 
+static inline int32 _bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -38,6 +42,8 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+/* Per-backend lowest-possible sentinel TID value */
+static ItemPointerData lowest;
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -72,12 +78,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.  If key was built
+ * using a leaf high key, leaf page will be relocated.
  *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
@@ -94,8 +97,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +134,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +148,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -158,8 +161,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * required when dealing with pg_upgrade'd !heapkeyspace indexes.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -199,8 +202,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -216,16 +219,16 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the key.nextkey=true
  * case, then we followed the wrong link and we need to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -242,10 +245,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -258,11 +259,15 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.  Duplicate pivots on
+	 * internal pages are useless to all index scans, which was a flaw in the
+	 * old design.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -270,7 +275,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -305,7 +310,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -325,13 +330,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,37 +345,75 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				savehigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	isleaf = P_ISLEAF(opaque);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state when scantid is available */
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->heapkeyspace || !key->restorebinsrch || key->scantid != NULL);
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available slot.
+		 * Note this covers two cases: the page is really empty (no keys), or
+		 * it contains only a high key.  The latter case is possible after
+		 * vacuuming.  This can never happen on an internal page, however,
+		 * since they are never empty (an internal page must have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				key->low = low;
+				key->high = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;					/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->high;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (high < low)
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -391,22 +427,40 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	savehigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		if (!isleaf)
+			result = _bt_compare(rel, key, page, mid);
+		else
+			result = _bt_nonpivot_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				savehigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->high = savehigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -416,12 +470,13 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
-	 * There must be one if _bt_compare() is playing by the rules.
+	 * There must be one if _bt_compare()/_bt_tuple_compare() is playing by
+	 * the rules.
 	 */
 	Assert(low > P_FIRSTDATAKEY(opaque));
 
@@ -431,21 +486,11 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
- *		This routine returns:
- *			<0 if scankey < tuple at offnum;
- *			 0 if scankey == tuple at offnum;
- *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
- *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored (if any, which there probably isn't)
@@ -456,26 +501,82 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
-	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	int			ntupatts;
 
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
+	return _bt_tuple_compare(rel, key, itup, ntupatts);
+}
+
+/*
+ * Optimized version of _bt_compare().  Only works on non-pivot tuples.
+ */
+static inline int32
+_bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum)
+{
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, page, offnum));
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(BTreeTupleGetNAtts(itup, rel) ==
+		   IndexRelationGetNumberOfAttributes(rel));
+	return _bt_tuple_compare(rel, key, itup, key->keysz);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
+ * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ *		This routine returns:
+ *			<0 if scankey < tuple;
+ *			 0 if scankey == tuple;
+ *			>0 if scankey > tuple.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
+ *----------
+ */
+int32
+_bt_tuple_compare(Relation rel,
+				  BTScanInsert key,
+				  IndexTuple itup,
+				  int ntupatts)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ItemPointer heapTid;
+	int			ncmpkey;
+	int			i;
+	ScanKey		scankey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +590,9 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, key->keysz);
+	scankey = key->scankeys;
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -540,8 +643,82 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (key->scantid == NULL)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (heapTid == NULL)
+		return 1;
+
+	return ItemPointerCompare(key->scantid, heapTid);
+}
+
+/*
+ * _bt_lowest_scantid() -- Manufacture low heap TID.
+ *
+ *		Create a heap TID that's strictly less than any possible real heap
+ *		TID to _bt_tuple_compare.  This is still treated as greater than
+ *		minus infinity.  The overall effect is that _bt_search follows
+ *		downlinks with scankey equal non-TID attribute(s), but a
+ *		truncated-away TID attribute, as the scankey is greater than the
+ *		downlink/pivot tuple as a whole. (Obviously this can only be of use
+ *		when a scankey has values for all key attributes other than the heap
+ *		TID tie-breaker attribute/scantid.)
+ *
+ * If we didn't do this then affected index scans would have to
+ * unnecessarily visit an extra page before moving right to the page they
+ * should have landed on from the parent in the first place.  There would
+ * even be a useless binary search on the left/first page, since a high key
+ * check won't have the search move right immediately (the high key will be
+ * identical to the downlink we should have followed in the parent, barring
+ * a concurrent page split).
+ *
+ * This is particularly important with unique index insertions, since "the
+ * first page the value could be on" has an exclusive buffer lock held while
+ * a subsequent page (usually the actual first page the value could be on)
+ * has a shared buffer lock held.  (There may also be heap buffer locks
+ * acquired during this process.)
+ *
+ * Note that implementing this by adding hard-coding to _bt_compare is
+ * unworkable, since that would break nextkey semantics in the common case
+ * where all non-TID key attributes have been provided.
+ */
+ItemPointer
+_bt_lowest_scantid(void)
+{
+	static ItemPointer low = NULL;
+
+	/*
+	 * A heap TID that's less than or equal to any possible real heap TID
+	 * would also work
+	 */
+	if (!low)
+	{
+		low = &lowest;
+
+		/* Lowest possible block is 0 */
+		ItemPointerSetBlockNumber(low, 0);
+		/* InvalidOffsetNumber less than any real offset */
+		ItemPointerSetOffsetNumber(low, InvalidOffsetNumber);
+	}
+
+	return low;
 }
 
 /*
@@ -575,8 +752,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData key;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
+	ScanKey		scankeys;
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -822,10 +1000,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built in the scankeys[]
+	 * array, using the keys identified by startKeys[].
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
+	scankeys = key.scankeys;
+
 	for (i = 0; i < keysCount; i++)
 	{
 		ScanKey		cur = startKeys[i];
@@ -1053,12 +1233,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/*
+	 * Initialize insertion scankey.
+	 *
+	 * Manufacture sentinel scan tid that's less than any possible heap TID in
+	 * the index when that might allow us to avoid unnecessary moves right
+	 * while descending the tree.
+	 *
+	 * Never do this for any nextkey case, since that would make _bt_search()
+	 * incorrectly land on the leaf page with the second user-attribute-wise
+	 * duplicate tuple, rather than landing on the leaf page with the next
+	 * user-attribute-distinct key > scankey, which is the intended behavior.
+	 * We could invent a _bt_highest_scantid() to use in nextkey cases, but
+	 * that would never actually save any cycles during the descent of the
+	 * tree; "_bt_binsrch() + nextkey = true" already behaves as if all tuples
+	 * <= scankey (in terms of the attributes/keys actually supplied in the
+	 * scankey) are < scankey.
+	 */
+	key.heapkeyspace = _bt_heapkeyspace(rel);
+	key.savebinsrch = key.restorebinsrch = false;
+	key.low = key.high = InvalidOffsetNumber;
+	key.nextkey = nextkey;
+	key.keysz = keysCount;
+	key.scantid = NULL;
+	if (key.keysz >= IndexRelationGetNumberOfKeyAttributes(rel) &&
+		!key.nextkey && key.heapkeyspace)
+		key.scantid = _bt_lowest_scantid();
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &key, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1293,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &key, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..978fe5dfe3 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -743,6 +743,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -796,8 +797,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -814,27 +813,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one MAXALIGN() quantum larger than the original first right tuple
+	 * it's derived from.  v4 deals with the problem by decreasing the limit
+	 * on the size of tuples inserted on the leaf level by the same small
+	 * amount.  Enforce the new v4+ limit on the leaf level, and the old limit
+	 * on internal levels, since pivot tuples may need to make use of the
+	 * spare MAXALIGN() quantum.  This should never fail on internal pages.
 	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -880,24 +873,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because
+			 * it improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since pivot tuples
+			 * are treated as a special case by _bt_check_third_page().
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -905,7 +909,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup, true);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +931,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +978,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1037,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1115,7 +1124,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
+		pfree(indexScanKey);
 
 		for (;;)
 		{
@@ -1127,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 205457ef99..c0d46e2beb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
+				IndexTuple firstright, bool build);
 
 
 /*
@@ -56,34 +58,62 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one like it at some point if the pivot tuple
+ *		actually came from the tree).
+ *
+ *		Note that we may occasionally have to share lock the metapage, in
+ *		order to determine whether or not the keys in the index are expected
+ *		to be unique (i.e. whether or not heap TID is treated as a tie-breaker
+ *		attribute).  Callers that cannot tolerate this can request that we
+ *		assume that this is a heapkeyspace index.
+ *
  *		The result is intended for use with _bt_compare().
  */
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)
 {
+	BTScanInsert res;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.  There is no need to explicitly represent truncated
+	 * attributes as having negative infinity values, since _bt_binsrch()
+	 * already makes _bt_search() follow the last downlink strictly less than
+	 * scankey on internal pages.  In other words, there isn't a difference
+	 * between < and <= when descending a B-tree, and so negative infinity is
+	 * the implicit value of omitted columns (at least in any context where
+	 * it's sensible to think about placeholder values).
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	res = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	res->heapkeyspace = assumeheapkeyspace || _bt_heapkeyspace(rel);
+	res->savebinsrch = res->restorebinsrch = false;
+	res->low = res->high = InvalidOffsetNumber;
+	res->nextkey = false;
+	res->keysz = Min(indnkeyatts, tupnatts);
+	res->scantid = res->heapkeyspace ? BTreeTupleGetHeapTID(itup) : NULL;
+	skey = res->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +126,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Keys built from truncated attributes are defensively represented as
+		 * NULL values, though they should still not participate in
+		 * comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,7 +150,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
+	return res;
 }
 
 /*
@@ -159,15 +201,6 @@ _bt_mkscankey_nodata(Relation rel)
 	return skey;
 }
 
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
-}
-
 /*
  * free a retracement stack made by _bt_search.
  */
@@ -2083,38 +2116,266 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes that follow an attribute in firstright that is not equal to the
+ * corresponding attribute in lastleft (equal according to insertion scan key
+ * semantics).
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare()/_bt_tuple_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
+ *
+ * CREATE INDEX callers must pass build = true, in order to avoid metapage
+ * access.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 bool build)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, lastleft, firstright, build);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Artificially force truncation to always append heap TID */
+	leavenatts = nkeyatts + 1;
+#endif
+
+	if (leavenatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * This must be an INCLUDE index where only non-key attributes could
+		 * be truncated away.  They are not considered part of the key space,
+		 * so it's still necessary to add a heap TID attribute to the new
+		 * pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal, and
+		 * there are no non-key attributes that need to be truncated in
+		 * passing.  It's necessary to add a heap TID attribute to the new
+		 * pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID attribute in the right item readily distinguishes the right
+	 * side of the split from the left side.  Use enlarged space that holds a
+	 * copy of first right tuple; place a heap TID value within the extra
+	 * space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 *
+	 * Callers generally try to avoid choosing a split point that necessitates
+	 * that we do this.  Splits of pages that only involve a single distinct
+	 * value (or set of values) must end up here, though.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a single MAXALIGN()
+	 * quantum.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_leave_natts - how many key attributes to leave when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  CREATE INDEX
+ * callers must pass build = true so that we may avoid metapage access.  (This
+ * is okay because CREATE INDEX always creates an index on the latest btree
+ * version.)
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+				bool build)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			leavenatts;
+	ScanKey		scankey;
+	BTScanInsert key;
+
+	key = _bt_mkscankey(rel, firstright, build);
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	Assert(key->keysz == nkeyatts);
+	scankey = key->scankeys;
+	leavenatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		isNull2 = (scankey->sk_flags & SK_ISNULL) != 0;
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											scankey->sk_argument)) != 0)
+			break;
+
+		leavenatts++;
+	}
+
+	/*
+	 * Make sure that an authoritative comparison that considers per-column
+	 * options like ASC/DESC/NULLS FIRST/NULLS LAST indicates that it's okay
+	 * to truncate firstright tuple up to leavenatts -- we expect to get a new
+	 * pivot that's strictly greater than lastleft when truncation can go
+	 * ahead.  (A truncated version of firstright is also bound to be strictly
+	 * less than firstright, since their attributes will be equal prior to one
+	 * or more truncated negative infinity attributes.)
+	 */
+	Assert(leavenatts > nkeyatts ||
+		   _bt_tuple_compare(rel, key, lastleft, leavenatts) > 0);
+
+	/* Can't leak memory here */
+	pfree(key);
+
+	return leavenatts;
 }
 
 /*
@@ -2137,6 +2398,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2418,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2428,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2439,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,8 +2472,82 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
 }
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		 itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	if (needheaptidspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+						itemsz, BTREE_VERSION, BTMaxItemSize(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
+						itemsz, BTMaxItemSizeNoHeapTid(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb80a..fe8f4fe2a7 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 5c4457179d..667c906b2e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index ee7fd83c02..5b27fda139 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ea495f1724..4f66ab5845 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -97,7 +97,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 typedef struct BTMetaPageData
 {
 	uint32		btm_magic;		/* should contain BTREE_MAGIC */
-	uint32		btm_version;	/* should contain BTREE_VERSION */
+	uint32		btm_version;	/* should be >= BTREE_META_VERSION */
 	BlockNumber btm_root;		/* current root location */
 	uint32		btm_level;		/* tree level of the root page */
 	BlockNumber btm_fastroot;	/* current "fast" root location */
@@ -114,16 +114,27 @@ typedef struct BTMetaPageData
 
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_META_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +214,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +255,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +272,42 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We avoid
+ * assuming that a tuple with INDEX_ALT_TID_MASK set is necessarily a pivot
+ * tuple.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -319,6 +366,64 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples.  It is the structure used to
+ * descend a B-Tree using _bt_search().  For details on its mutable state, see
+ * _bt_binsrch and _bt_findinsertloc.
+ *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search().
+ *
+ * keysz is the number of insertion scankeys present (scantid is counted
+ * separately).
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.   The array is
+ * used as a flexible array member, though it's sized in a way that makes it
+ * possible to use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique() is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -541,6 +646,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -559,15 +665,15 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   BTScanInsert key,
 		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup,
+				  int ntupatts);
+extern ItemPointer _bt_lowest_scantid(void);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +682,9 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup,
+								  bool assumeheapkeyspace);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -600,8 +706,11 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, bool build);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+		   bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 819373031c..06da0965f7 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 1d12b01068..06fe44d39a 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..08cf72d670 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -167,6 +167,8 @@ BTArrayKeyInfo
 BTBuildState
 BTCycleId
 BTIndexStat
+BTInsertionKey
+BTInsertionKeyData
 BTLeader
 BTMetaPageData
 BTOneVacInfo
@@ -2207,6 +2209,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

v9-0005-Add-high-key-continuescan-optimization.patchapplication/x-patch; name=v9-0005-Add-high-key-continuescan-optimization.patchDownload

From a4140e87691f235b9ac0d9755b214f98ea3b1b05 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v9 5/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++---
 src/backend/access/nbtree/nbtutils.c  | 60 +++++++++++++++++++++------
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 5d9cf856f8..c1a483e8d1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1428,7 +1428,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber maxoff;
 	int			itemIndex;
 	IndexTuple	itup;
-	bool		continuescan;
+	bool		continuescan = true;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1496,16 +1496,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you'd
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples within a range of acceptable split points.  There
+		 * is often natural locality around what ends up on each leaf page,
+		 * which is worth taking advantage of here.
+		 */
+		if (!P_RIGHTMOST(opaque) && continuescan)
+			(void) _bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c1be82acb..a4964dc22c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_leave_natts(Relation rel, IndexTuple lastleft,
 				IndexTuple firstright, bool build);
@@ -1398,7 +1398,10 @@ _bt_mark_scankey_required(ScanKey skey)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1408,6 +1411,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1421,21 +1425,24 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		Assert(offnum != P_HIKEY || P_RIGHTMOST(opaque));
 		if (ScanDirectionIsForward(dir))
 		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
+			/* forward scan callers check high key instead */
+			return NULL;
 		}
 		else
 		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
+			/* return immediately if there are more tuples on the page */
 			if (offnum > P_FIRSTDATAKEY(opaque))
 				return NULL;
 		}
@@ -1450,6 +1457,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1461,11 +1469,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1596,8 +1617,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1614,6 +1635,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

#44

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#43)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Dec 3, 2018 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v9, which does things that way. There are no interesting
changes, though I have set things up so that a later patch in the
series can add "dynamic prefix truncation" -- I do not include any
such patch in v9, though. I'm going to start a new thread on that
topic, and include the patch there, since it's largely unrelated to
this work, and in any case still isn't in scope for Postgres 12 (the
patch is still experimental, for reasons that are of general
interest).

The dynamic prefix truncation thread that I started:

/messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
--
Peter Geoghegan

#45

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: Peter Geoghegan (#43)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 04/12/2018 05:10, Peter Geoghegan wrote:

Attached is v9, ...

I spent some time reviewing this. I skipped the first patch, to add a
column to pg_depend, and I got through patches 2, 3 and 4. Impressive
results, and the code looks sane.

I wrote a laundry list of little comments on minor things, suggested
rewordings of comments etc. I hope they're useful, but feel free to
ignore/override my opinions of any of those, as you see best.

But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separate
commit, first? It seems like a good thing for readability on its own,
and would slim the big main patch. (And make sure to credit Andrey for
that idea in the commit message.)

2. In the "Treat heap TID as part of the nbtree key space" patch:

*		Build an insertion scan key that contains comparison data from itup
*		as well as comparator routines appropriate to the key datatypes.
*
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one like it at some point if the pivot tuple
+ *		actually came from the tree).
+ *
+ *		Note that we may occasionally have to share lock the metapage, in
+ *		order to determine whether or not the keys in the index are expected
+ *		to be unique (i.e. whether or not heap TID is treated as a tie-breaker
+ *		attribute).  Callers that cannot tolerate this can request that we
+ *		assume that this is a heapkeyspace index.
+ *
*		The result is intended for use with _bt_compare().
*/
-ScanKey
-_bt_mkscankey(Relation rel, IndexTuple itup)
+BTScanInsert
+_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace)

This 'assumeheapkeyspace' flag feels awkward. What if the caller knows
that it is a v3 index? There's no way to tell _bt_mkscankey() that.
(There's no need for that, currently, but seems a bit weird.)

_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which
calls _bt_mkscankey(). It's holding a lock on the page being split. Do
we risk deadlock by locking the metapage at the same time?

I don't have any great ideas on what to do about this, but it's awkward
as it is. Can we get away without the new argument? Could we somehow
arrange things so that rd_amcache would be guaranteed to already be set?

3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hard
to understand. I've spent a while looking at it now, and I think I have
a vague understanding of what things it takes into consideration, but I
don't understand why it performs those multiple stages, what each stage
does, and how that leads to an overall strategy. I think a rewrite would
be in order, to make that more understandable. I'm not sure what exactly
it should look like, though.

If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or
SINGLE_VALUE modes, it has to redo a lot of the work that was done in
the DEFAULT mode already. That's probably not a big deal in practice,
performance-wise, but I feel that it's another hint that some
refactoring would be in order.

One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split at
that location. Like the current code does. For each offset, calculate a
"penalty" based on two factors:

* free space on each side
* the number of attributes in the pivot tuple, and whether it needs to
store the heap TID

Define the penalty function so that having to add a heap TID to the
pivot tuple is considered very expensive, more expensive than anything
else, and truncating away other attributes gives a reward of some size.

However, naively computing the penalty upfront for every offset would be
a bit wasteful. Instead, start from the middle of the page, and walk
"outwards" towards both ends, until you find a "good enough" penalty.

Or something like that...

Now, the laundry list of smaller items:

----- laundry list begins -----

1st commits commit message:

Make nbtree treat all index tuples as having a heap TID trailing key
attribute. Heap TID becomes a first class part of the key space on all
levels of the tree. Index searches can distinguish duplicates by heap
TID, at least in principle.

What do you mean by "at least in principle"?

Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).

What is a "Secondary" index insertion?

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added. This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes. This can increase fan-out,
especially when there are several attributes in an index.

Suggestion: "when there are several attributes in an index" -> "in a
multi-column index"

+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+

What is "low-context fashion"?

+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  Every attribute should have an
+ * entry during insertion, though not necessarily when a regular index scan
+ * uses an insertion scankey to find an initial leaf page.

Suggestion: Reword to something like "During insertion, there must be a
scan key for every attribute, but when starting a regular index scan,
some can be omitted."

+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch() to inexpensively repeat a binary
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique() is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;

It would feel more natural to me, to have the mutable state *after* the
other fields. Also, it'd feel less error-prone to have 'scantid' be
ItemPointerData, rather than a pointer to somewhere else. The
'heapkeyspace' name isn't very descriptive. I understand that it means
that the heap TID is part of the keyspace. Not sure what to suggest
instead, though.

+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
+pointer order.

Suggestion: "item pointer" -> TID, to use consistent terms.

We don't use btree keys to disambiguate downlinks from the
+internal pages during a page split, though: only one entry in the parent
+level will be pointing at the page we just split, so the link fields can be
+used to re-find downlinks in the parent via a linear search.  (This is
+actually a legacy of when heap TID was not treated as part of the keyspace,
+but it does no harm to keep things that way.)

I don't understand this paragraph.

+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to distinguish tuples which don't point to heap tuples, that are
+used only for tree navigation.  Pivot tuples include all tuples on
+non-leaf pages and high keys on leaf pages.

Suggestion: reword to "All tuples on non-leaf pages, and high keys on
leaf pages, are pivot tuples"

Note that pivot tuples are
+only used to represent which part of the key space belongs on each page,
+and can have attribute values copied from non-pivot tuples that were
+deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
+"separator" key and downlink, just a separator key (in practice the
+downlink will be garbage), or just a downlink.

Rather than store garbage, set it to zeros?

+Lehman and Yao require that the key range for a subtree S is described by
+Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page.
+A search where the scan key is equal to a pivot tuple in an upper tree
+level must descend to the left of that pivot to ensure it finds any equal
+keys.  Pivot tuples are always a _strict_ lower bound on items on their
+downlink page; the equal item(s) being searched for must therefore be to
+the left of that downlink page on the next level down.  (It's possible to
+arrange for internal page tuples to be strict lower bounds in all cases
+because their values come from leaf tuples, which are guaranteed unique by
+the use of heap TID as a tiebreaker.  We also make use of hard-coded
+negative infinity values in internal pages.  Rightmost pages don't have a
+high key, though they conceptually have a positive infinity high key).  A
+handy property of this design is that there is never any need to
+distinguish between equality in the case where all attributes/keys are used
+in a scan from equality where only some prefix is used.

"distringuish between ... from ..." doesn't sound like correct grammar.
Suggestion: "distinguish between ... and ...", or just "distinguish ...
from ...". Or rephrase the sentence some other way.

+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.

That's a very long sentence.

* Since the truncated tuple is probably smaller than the
* original, it cannot just be copied in place (besides, we want
* to actually save space on the leaf page). We delete the
* original high key, and add our own truncated high key at the
* same offset. It's okay if the truncated tuple is slightly
* larger due to containing a heap TID value, since pivot tuples
* are treated as a special case by _bt_check_third_page().

By "treated as a special case", I assume that _bt_check_third_page()
always reserves some space for that? Maybe clarify that somehow.

_bt_truncate():

This is possible when there are
* attributes that follow an attribute in firstright that is not equal to the
* corresponding attribute in lastleft (equal according to insertion scan key
* semantics).

I can't comprehend that sentence. Simpler English, maybe add an example,
please.

/*
* _bt_leave_natts - how many key attributes to leave when truncating.
*
* Caller provides two tuples that enclose a split point. CREATE INDEX
* callers must pass build = true so that we may avoid metapage access. (This
* is okay because CREATE INDEX always creates an index on the latest btree
* version.)
*
* This can return a number of attributes that is one greater than the
* number of key attributes for the index relation. This indicates that the
* caller must use a heap TID as a unique-ifier in new pivot tuple.
*/
static int
_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
bool build)

IMHO "keep" would sound better here than "leave".

+	if (needheaptidspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+						itemsz, BTREE_VERSION, BTMaxItemSize(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"",
+						itemsz, BTMaxItemSizeNoHeapTid(page),
+						RelationGetRelationName(rel)),
+				 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+						   ItemPointerGetBlockNumber(&newtup->t_tid),
+						   ItemPointerGetOffsetNumber(&newtup->t_tid),
+						   RelationGetRelationName(heap)),
+				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+						 "Consider a function index of an MD5 hash of the value, "
+						 "or use full text indexing."),
+				 errtableconstraint(heap,
+									RelationGetRelationName(rel))));

Could restructure this to avoid having two almost identical strings to
translate.

#define BTREE_METAPAGE	0		/* first page is meta */
#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
#define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_META_VERSION	3	/* minimal version with all meta fields */

BTREE_META_VERSION is a strange name for version 3. I think this
deserves a more verbose comment, above these #defines, to list all the
versions and their differences.

v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.

I don't understand this sentence.

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

Hmm. Is the assumption here that if a page is full of duplicates, there
will be no more insertions into that page? Why?

The number of cycles added is not very noticeable, which is important,
since _bt_findsplitloc() is run while an exclusive (leaf page) buffer
lock is held. We avoid using authoritative insertion scankey
comparisons, unlike suffix truncation proper.

What do you do instead, then? memcmp? (Reading the patch, yes.
Suggestion: "We use a faster binary comparison, instead of proper
datatype-aware comparison, for speed".

Aside from performance, it would feel inappropriate to call user-defined
code while holding a buffer lock, anyway.

+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already, without provoking a split.

I'd leave out the ", without provoking a split" part. Or maybe reword to
"if you pretend that the incoming tuple fit and was placed on the page
already".

+Choosing the split point between two index tuples with differences that
+appear as early as possible results in truncating away as many suffix
+attributes as possible.

It took me a while to understand what the "appear as early as possible"
means here. It's talking about a multi-column index, and about finding a
difference in one of the leading key columns. Not, for example, about
finding a split point early in the page.

An array of acceptable candidate split points
+(points that balance free space on either side of the split sufficiently
+well) is assembled in a pass over the page to be split, sorted by delta.
+An optimal split point is chosen during a pass over the assembled array.
+There are often several split points that allow the maximum number of
+attributes to be truncated away -- we choose whichever one has the lowest
+free space delta.

Perhaps we should leave out these details in the README, and explain
this in the comments of the picksplit-function itself? In the README, I
think a more high-level description of what things are taken into
account when picking the split point, would be enough.

+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.

Suggestion: reword to "... , but that isn't the only benefit" ?

There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been without actually making any pivot tuple smaller due to restrictions
+relating to alignment.

Suggestion: reword to "... smaller in size than it would otherwise be,
without ..."

and "without making any pivot tuple *physically* smaller, due to alignment".

This sentence is a bit of a cliffhanger: what are those cases, and how
is that possible?

The criteria for choosing a leaf page split point
+for suffix truncation is also predictive of future space utilization.

How so? What does this mean?

+Furthermore, even truncation that doesn't make pivot tuples smaller still
+prevents pivot tuples from being more restrictive than truly necessary in
+how they describe which values belong on which pages.

Ok, I guess these sentences resolve the cliffhanger I complained about.
But this still feels like magic. When you split a page, all of the
keyspace must belong on the left or the right page. Why does it make a
difference to space utilization, where exactly you split the key space?

+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.

Ok, so this explains it further, I guess. I find this paragraph
difficult to understand, though. The important thing here is the idea
that some split points are more "discriminating" than others, but I
think it needs some further explanation. What makes a split point more
discriminating? Maybe add an example.

+Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
+tuple that it's based on (the first item on the right page), since a heap
+TID must be appended when nothing else distinguishes each side of a leaf
+split.  Truncation cannot simply reuse the leaf level representation: we
+must append an additional attribute, rather than incorrectly leaving a heap
+TID in the generic IndexTuple item pointer field.  (The field is already
+used by pivot tuples to store their downlink, plus some additional
+metadata.)

That's not really the fault of suffix truncation as such, but the
process of turning a leaf tuple into a pivot tuple. It would happen even
if you didn't truncate anything.

I think this point, that we have to store the heap TID differently in
pivot tuples, would deserve a comment somewhere else, too. While reading
the patch, I didn't realize that that's what we're doing, until I read
this part of the README, even though I saw the new code to deal with
heap TIDs elsewhere in the code. Not sure where, maybe in
BTreeTupleGetHeapTID().

+Adding a heap TID attribute during a leaf page split should only occur when
+the page to be split is entirely full of duplicates (the new item must also
+be a duplicate).  The logic for selecting a split point goes to great
+lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost
+always manages to pick a split point between two user-key-distinct tuples,
+accepting a completely lopsided split if it must.

This is the first mention of "many duplicates" mode. Maybe just say
"_bt_findsplitloc() almost always ..." or "The logic for selecting a
split point goes to great lengths to avoid heap TIDs in pivots, and
almost always manages to pick a split point between two
user-key-distinct tuples, accepting a completely lopsided split if it must."

Once appending a heap
+TID to a split's pivot becomes completely unavoidable, there is a fallback
+strategy --- "single value" mode is used, which makes page splits pack the
+new left half full by using a high fillfactor.  Single value mode leads to
+better overall space utilization when a large number of duplicates are the
+norm, and thereby also limits the total number of pivot tuples with an
+untruncated heap TID attribute.

This assumes that tuples are inserted in increasing TID order, right?
Seems like a valid assumption, no complaints there, but it's an
assumption nevertheless.

I'm not sure if this level of detail is worthwhile in the README. This
logic on deciding the split point is all within the _bt_findsplitloc()
function, so maybe put this explanation there. In the README, a more
high-level explanation of what things _bt_findsplitloc() considers,
should be enough.

_bt_findsplitloc(), and all its helper structs and subroutines, are
about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it
would be a good idea to move it to a whole new nbtsplitloc.c file? It's
a very isolated piece of code.

In the comment on _bt_leave_natts_fast():

+ * Testing has shown that an approach involving treating the tuple as a
+ * decomposed binary string would work almost as well as the approach taken
+ * here.  It would also be faster.  It might actually be necessary to go that
+ * way in the future, if suffix truncation is made sophisticated enough to
+ * truncate at a finer granularity (i.e. truncate within an attribute, rather
+ * than just truncating away whole attributes).  The current approach isn't
+ * markedly slower, since it works particularly well with the "perfect
+ * penalty" optimization (there are fewer, more expensive calls here).  It
+ * also works with INCLUDE indexes (indexes with non-key attributes) without
+ * any special effort.

That's an interesting tidbit, but I'd suggest just removing this comment
altogether. It's not really helping to understand the current
implementation.

v9-0005-Add-high-key-continuescan-optimization.patch commit message:

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

.. but we're missing the other optimizations that make it more
effective, so it probably won't do much for v3 indexes. Does it make
them slower? It's probably acceptable, even if there's a tiny
regression, but I'm curious.

----- laundry list ends -----

- Heikki

#46

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Heikki Linnakangas (#45)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some time reviewing this. I skipped the first patch, to add a
column to pg_depend, and I got through patches 2, 3 and 4. Impressive
results, and the code looks sane.

Thanks! I really appreciate your taking the time to do such a thorough review.

You were right to skip the first patch, because there is a fair chance
that it won't be used in the end. Tom is looking into the pg_depend
problem that I paper over with the first patch.

I wrote a laundry list of little comments on minor things, suggested
rewordings of comments etc. I hope they're useful, but feel free to
ignore/override my opinions of any of those, as you see best.

I think that that feedback is also useful, and I'll end up using 95%+
of it. Much of the information I'm trying to get across is very
subtle.

But first, a few slightly bigger (medium-sized?) issues that caught my eye:

1. How about doing the BTScanInsertData refactoring as a separate
commit, first? It seems like a good thing for readability on its own,
and would slim the big main patch. (And make sure to credit Andrey for
that idea in the commit message.)

Good idea. I'll do that.

This 'assumeheapkeyspace' flag feels awkward. What if the caller knows
that it is a v3 index? There's no way to tell _bt_mkscankey() that.
(There's no need for that, currently, but seems a bit weird.)

This is there for CREATE INDEX -- we cannot access the metapage during
an index build. We'll only be able to create new v4 indexes with the
patch applied, so we can assume that heap TID is part of the key space
safely.

_bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which
calls _bt_mkscankey(). It's holding a lock on the page being split. Do
we risk deadlock by locking the metapage at the same time?

I already had vague concerns along the same lines. I am also concerned
about index_getprocinfo() calls that happen in the same code path,
with a buffer lock held. (SP-GiST's doPickSplit() function can be
considered a kind of precedent that makes the second issue okay, I
suppose.)

See also: My later remarks on the use of "authoritative comparisons"
from this same e-mail.

I don't have any great ideas on what to do about this, but it's awkward
as it is. Can we get away without the new argument? Could we somehow
arrange things so that rd_amcache would be guaranteed to already be set?

These are probably safe in practice, but the way that we rely on them
being safe from a distance is a concern. Let me get back to you on
this.

3. In the "Pick nbtree split points discerningly" patch

I find the different modes and the logic in _bt_findsplitloc() very hard
to understand. I've spent a while looking at it now, and I think I have
a vague understanding of what things it takes into consideration, but I
don't understand why it performs those multiple stages, what each stage
does, and how that leads to an overall strategy. I think a rewrite would
be in order, to make that more understandable. I'm not sure what exactly
it should look like, though.

I've already refactored that a little bit for the upcoming v10. The
way _bt_findsplitloc() state is initially set up becomes slightly more
streamlined. It still works in the same way, though, so you'll
probably only think that the new version is a minor improvement.
(Actually, v10 focuses on making _bt_splitatnewitem() a bit less
magical, at least right now.)

If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or
SINGLE_VALUE modes, it has to redo a lot of the work that was done in
the DEFAULT mode already. That's probably not a big deal in practice,
performance-wise, but I feel that it's another hint that some
refactoring would be in order.

The logic within _bt_findsplitloc() has been very hard to refactor all
along. You're right that there is a fair amount of redundant-ish work
that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform.
The idea is to not burden the common DEFAULT case, and to keep the
control flow relatively simple.

I'm sure that if I was in your position I'd say something similar. It
is complicated in subtle ways, that looks like they might not matter,
but actually do. I am working off a fair variety of test cases, which
really came in handy. I remember thinking that I'd simplified it a
couple of times back in August or September, only to realize that I'd
regressed a case that I cared about. I eventually realized that I
needed to come up with a comprehensive though relatively fast test
suite, which seems essential for refactoring _bt_findsplitloc(), and
maybe even for fully understanding how _bt_findsplitloc() works.

Another complicating factor is that I have to worry about the number
of cycles used under a buffer lock (not just the impact on space
utilization).

With all of that said, I am willing to give it another try. You've
seen opportunities to refactor that I missed before now. More than
once.

One idea on how to restructure that:

Make a single pass over all the offset numbers, considering a split at
that location. Like the current code does. For each offset, calculate a
"penalty" based on two factors:

* free space on each side
* the number of attributes in the pivot tuple, and whether it needs to
store the heap TID

Define the penalty function so that having to add a heap TID to the
pivot tuple is considered very expensive, more expensive than anything
else, and truncating away other attributes gives a reward of some size.

As you go on to say, accessing the tuple to calculate a penalty like
this is expensive, and shouldn't be done exhaustively if at all
possible. We're only access item pointer information (that is, lp_len)
in the master branch's _bt_findsplitloc(), and that's all we do within
the patch until the point where we have a (usually quite small) array
of candidate split points, sorted by delta.

Doing a pass over the page to assemble an array of candidate splits,
and then doing a pass over the sorted array of splits with
tolerably-low left/right space deltas works pretty well. "Mixing" the
penalties together up front like that is something I considered, and
decided not to pursue -- it obscures relatively uncommon though
sometimes important large differences, that a single DEFAULT mode
style pass would probably miss. MANY_DUPLICATES mode is totally
exhaustive, because it's worth being totally exhaustive in the extreme
case where there are only a few distinct values, and it's still
possible to avoid a large grouping of values that spans more than one
page. But it's not worth being exhaustive like that most of the time.
That's the useful thing about having 2 alternative modes, that we
"escalate" to if and only if it seems necessary to. MANY_DUPLICATES
can be expensive, because no workload is likely to consistently use
it. Most will almost always use DEFAULT, some will use SINGLE_VALUE
quite a bit -- MANY_DUPLICATES is for when we're "in between" those
two.Seems unlikely to be the steady state.

Maybe we could just have MANY_DUPLICATES mode, and making SINGLE_VALUE
mode something that happens within a DEFAULT pass. It's probably not
worth it, though -- SINGLE_VALUE mode generally wants to split the
page in a way that makes the left page mostly full, and the right page
mostly empty. So eliminating SINGLE_VALUE mode would probably not
simplify the code.

However, naively computing the penalty upfront for every offset would be
a bit wasteful. Instead, start from the middle of the page, and walk
"outwards" towards both ends, until you find a "good enough" penalty.

You can't start at the middle of the page, though.

You have to start at the left (though you could probably start at the
right instead). This is because of page fragmentation -- it's not
correct to assume that the line pointer offset into tuple space on the
page (firstright linw pointer lp_off for candidate split point) tells
you anything about what the space delta will be after the split. You
have to exhaustively add up the free space before the line pointer
(the free space for all earlier line pointers) before seeing if the
line pointer works as a split point, since each previous line
pointer's tuple could be located anywhere in the original page's tuple
space (anywhere to the left or to the right of where it would be in
the simple/unfragmented case).

1st commits commit message:

Make nbtree treat all index tuples as having a heap TID trailing key
attribute. Heap TID becomes a first class part of the key space on all
levels of the tree. Index searches can distinguish duplicates by heap
TID, at least in principle.

What do you mean by "at least in principle"?

I mean that we don't really do that currently, because we don't have
something like retail index tuple deletion. However, we do have, uh,
insertion, so I guess that this is just wrong. Will fix.

Secondary index insertions will descend
straight to the leaf page that they'll insert on to (unless there is a
concurrent page split).

What is a "Secondary" index insertion?

Secondary index is how I used to refer to a non-unique index, until I
realized that that was kind of wrong. (In fact, all indexes in
Postgres are secondary indexes, because we always use a heap, never a
clustered index.)

Will fix.

Suggestion: "when there are several attributes in an index" -> "in a
multi-column index"

I'll change it to say that.

+/*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+     Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+

What is "low-context fashion"?

I mean that it works with non-pivot tuples in INCLUDE indexes without
special effort on the caller's part, while also fetching the number of
key attributes in any pivot tuple, where it might well be <
IndexRelationGetNumberOfKeyAttributes(). Maybe no comment is necessary
-- BTreeTupleGetNKeyAtts() is exactly what it sounds like to somebody
that already knows about BTreeTupleGetNAtts().

Suggestion: Reword to something like "During insertion, there must be a
scan key for every attribute, but when starting a regular index scan,
some can be omitted."

Will do.

It would feel more natural to me, to have the mutable state *after* the
other fields.

I fully agree, but I can't really change it. The struct
BTScanInsertData ends with a flexible array member, though its sized
INDEX_MAX_KEYS because _bt_first() wants to allocate it on the stack
without special effort.

This was found to make a measurable difference with nested loop joins
-- I used to always allocate BTScanInsertData using palloc(), until I
found a regression. This nestloop join issue must be why commit
d961a568 removed an insertion scan key palloc() from _bt_first(), way
back in 2005. It seems like _bt_first() should remain free of
palloc()s, which it seems to actually manage to do, despite being so
hairy.

Also, it'd feel less error-prone to have 'scantid' be
ItemPointerData, rather than a pointer to somewhere else.

It's useful for me to be able to set it to NULL, though -- I'd need
another bool to represent the absence of a scantid if the field was
ItemPointerData (the absence could occur when _bt_mkscankey() is
passed a pivot tuple with its heap TID already truncated away, for
example). Besides, the raw scan keys themselves are very often
pointers to an attribute in some index tuple -- a tuple that the
caller needs to keep around for as long as the insertion scan key
needs to be used. Why not do the same thing with scantid? It is more
or less just another attribute, so it's really the same situation as
before.

The 'heapkeyspace' name isn't very descriptive. I understand that it means
that the heap TID is part of the keyspace. Not sure what to suggest
instead, though.

I already changed this once, based on a similar feeling. If you come
up with an even better name than "heapkeyspace", let me know. :-)

+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap item
+pointer order.

Suggestion: "item pointer" -> TID, to use consistent terms.

Will do.

We don't use btree keys to disambiguate downlinks from the
+internal pages during a page split, though: only one entry in the parent
+level will be pointing at the page we just split, so the link fields can be
+used to re-find downlinks in the parent via a linear search.  (This is
+actually a legacy of when heap TID was not treated as part of the keyspace,
+but it does no harm to keep things that way.)

I don't understand this paragraph.

I mean that we could now "go full Lehman and Yao" if we wanted to:
it's not necessary to even use the link field like this anymore. We
don't do that because of v3 indexes, but also because it doesn't
actually matter. The current way of re-finding downlinks would
probably even be better in a green field situation, in fact -- it's
just a bit harder to explain in a research paper.

Suggestion: reword to "All tuples on non-leaf pages, and high keys on
leaf pages, are pivot tuples"

Will do.

Note that pivot tuples are
+only used to represent which part of the key space belongs on each page,
+and can have attribute values copied from non-pivot tuples that were
+deleted and killed by VACUUM some time ago.  A pivot tuple may contain a
+"separator" key and downlink, just a separator key (in practice the
+downlink will be garbage), or just a downlink.

Rather than store garbage, set it to zeros?

There may be minor forensic value in keeping the item pointer block as
the heap block (but not the heap item pointer) within leaf high keys
(i.e. only changing it when it gets copied over for insertion into the
parent, and the block needs to point to the leaf child). I recall
discussing this with Alexander Korotkov shortly before the INCLUDE
patch went in. I'd rather keep it that way, rather than zeroing.

I could say "undefined" instead of "garbage", though. Not at all
attached to that wording.

"distringuish between ... from ..." doesn't sound like correct grammar.
Suggestion: "distinguish between ... and ...", or just "distinguish ...
from ...". Or rephrase the sentence some other way.

Yeah, I mangled the grammar. Which is kind of surprising, since I make
a very important point about why strict lower bounds are handy in that
sentence!

+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split when the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left page,
+and the first index tuple on the post-split right page as belonging on the
+right page.

That's a very long sentence.

Will restructure.

* Since the truncated tuple is probably smaller than the
* original, it cannot just be copied in place (besides, we want
* to actually save space on the leaf page). We delete the
* original high key, and add our own truncated high key at the
* same offset. It's okay if the truncated tuple is slightly
* larger due to containing a heap TID value, since pivot tuples
* are treated as a special case by _bt_check_third_page().

By "treated as a special case", I assume that _bt_check_third_page()
always reserves some space for that? Maybe clarify that somehow.

I'll just say that _bt_check_third_page() reserves space for it in the
next revision of the patch.

_bt_truncate():

This is possible when there are
* attributes that follow an attribute in firstright that is not equal to the
* corresponding attribute in lastleft (equal according to insertion scan key
* semantics).

I can't comprehend that sentence. Simpler English, maybe add an example,
please.

Okay.

static int
_bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
bool build)

IMHO "keep" would sound better here than "leave".

WFM.

Could restructure this to avoid having two almost identical strings to
translate.

I'll try.

#define BTREE_METAPAGE       0               /* first page is meta */
#define BTREE_MAGIC          0x053162        /* magic number of btree pages */
-#define BTREE_VERSION        3               /* current version number */
+#define BTREE_VERSION        4               /* current version number */
#define BTREE_MIN_VERSION    2       /* minimal supported version number */
+#define BTREE_META_VERSION   3       /* minimal version with all meta fields */

BTREE_META_VERSION is a strange name for version 3. I think this
deserves a more verbose comment, above these #defines, to list all the
versions and their differences.

Okay, but what would be better? I'm trying to convey that
BTREE_META_VERSION is the last version where upgrading was a simple
matter of changing the metapage, which can be performed on the fly.
The details of what were added to v3 (what nbtree stuff went into
Postgres 11) are not really interesting enough to have a descriptive
nbtree.h #define name. The metapage-only distinction is actually the
interesting distinction here (if I could do the upgrade on-the-fly,
there'd be no need for a v3 #define at all).

v9-0003-Pick-nbtree-split-points-discerningly.patch commit message:

Add infrastructure to determine where the earliest difference appears
among a pair of tuples enclosing a candidate split point.

I don't understand this sentence.

A (candidate) split point is a point *between* two enclosing tuples on
the original page, provided you pretend that the new tuple that caused
the split is already on the original page. I probably don't need to be
(un)clear on that in the commit message, though. I think that I'll
probably end up committing 0002-* and 0003-* in one go anyway (though
not before doing the insertion scan key struct refactoring in a
separate commit, as you suggest).

_bt_findsplitloc() is also taught to care about the case where there are
many duplicates, making it hard to find a distinguishing split point.
_bt_findsplitloc() may even conclude that it isn't possible to avoid
filling a page entirely with duplicates, in which case it packs pages
full of duplicates very tightly.

Hmm. Is the assumption here that if a page is full of duplicates, there
will be no more insertions into that page? Why?

This is a really important point, that should probably have been in
your main feedback, rather than the laundry list. I was hoping you'd
comment on this more, in fact.

Imagine the extreme (and admittedly unrealistic) case first: We have a
page full of duplicates, all of which point to one heap page, and with
a gapless sequence of heap TID item pointers. It's literally
impossible to have another page split in this extreme case, because
VACUUM is guaranteed to kill the tuples in the leaf page before
anybody can insert next time (IOW, there has to be TID recycling
before an insertion into the leaf page is even possible).

Now, I've made the "fillfactor" 99, so I haven't actually assumed that
there will be *no* further insertions on the page. I'm almost assuming
that, but not quite. My thinking was that I should match the greedy
behavior that we already have to some degree, and continue to pack
leaf pages full of duplicates very tight. I am quite willing to
consider whether or not I'm still being too aggressive, all things
considered. If I made it 50:50, that would make indexes with
relatively few distinct values significantly larger than on master,
which would probably be deemed a regression. FWIW, I think that even
that regression in space utilization would be more than made up for in
other ways. The master branch _bt_findinsertloc() stuff is a disaster
with many duplicates for a bunch of reasons that are even more
important than the easy-to-measure bloat issue (FPIs, unnecessary
buffer lock contention... I could go on).

What value do you think works better than 99? 95? 90? I'm open minded
about this. I have my own ideas about why 99 works, but they're based
on intuitions that might fail to consider something important. The
current behavior with many duplicates is pretty awful, so we can at
least be sure that it isn't any worse than that.

What do you do instead, then? memcmp? (Reading the patch, yes.
Suggestion: "We use a faster binary comparison, instead of proper
datatype-aware comparison, for speed".

WFM.

Aside from performance, it would feel inappropriate to call user-defined
code while holding a buffer lock, anyway.

But we do that all the time for this particular variety of user
defined code? I mean, we actually *have* to use the authoritative
comparisons at the last moment, once we actually make our mind up
about where to split -- nothing else is truly trustworthy. So, uh, we
actually do this "inappropriate" thing -- just not that much of it.

I'd leave out the ", without provoking a split" part. Or maybe reword to
"if you pretend that the incoming tuple fit and was placed on the page
already".

Okay.

It took me a while to understand what the "appear as early as possible"
means here. It's talking about a multi-column index, and about finding a
difference in one of the leading key columns. Not, for example, about
finding a split point early in the page.

This is probably a hold-over from when we didn't look at candidate
split point tuples an attribute at a time (months ago, it was
something pretty close to a raw memcmp()). Will fix.

Perhaps we should leave out these details in the README, and explain
this in the comments of the picksplit-function itself? In the README, I
think a more high-level description of what things are taken into
account when picking the split point, would be enough.

Agreed.

+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.

Suggestion: reword to "... , but that isn't the only benefit" ?

WFM.

There are cases where suffix truncation can
+leave a B-Tree significantly smaller in size than it would have otherwise
+been without actually making any pivot tuple smaller due to restrictions
+relating to alignment.

Suggestion: reword to "... smaller in size than it would otherwise be,
without ..."

WFM.

and "without making any pivot tuple *physically* smaller, due to alignment".

WFM.

This sentence is a bit of a cliffhanger: what are those cases, and how
is that possible?

This is something you see with the TPC-C indexes, even without the new
split stuff. The TPC-C stock pk is about 45% smaller with that later
commit, but it's something like 6% or 7% smaller even without it (or
maybe it's the orderlines pk). And without ever managing to make a
pivot tuple physically smaller. This happens because truncating away
trailing attributes allows more stuff to go on the first right half of
a split. In more general terms: suffix truncation avoids committing
ourselves to rules about where values should go that are stricter than
truly necessary. On balance, this improves space utilization quite
noticeably, even without the special cases where really big
improvements are made.

If that still doesn't make sense, perhaps you should just try out the
TPC-C stuff without the new split patch, and see for yourself. The
easiest way to do that is to follow the procedure I describe here:

https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc

(BenchmarkSQL is by far the best TPC-C implementation I've found that
works with Postgres, BTW. Yes, I also hate Java.)

Ok, I guess these sentences resolve the cliffhanger I complained about.
But this still feels like magic. When you split a page, all of the
keyspace must belong on the left or the right page. Why does it make a
difference to space utilization, where exactly you split the key space?

You have to think about the aggregate effect, rather than thinking
about a single split at a time. But, like I said, maybe the best thing
is to see the effect for yourself with TPC-C (while reverting the
split-at-new-item patch).

Ok, so this explains it further, I guess. I find this paragraph
difficult to understand, though. The important thing here is the idea
that some split points are more "discriminating" than others, but I
think it needs some further explanation. What makes a split point more
discriminating? Maybe add an example.

An understandable example seems really hard, even though the effect is
clear. Maybe I should just say *nothing* about the benefits when pivot
tuples don't actually shrink? I found it pretty interesting, and maybe
even something that makes it more understandable, but maybe that isn't
a good enough reason to keep the explanation.

This doesn't address your exact concern, but I think it might help:

Bayer's Prefix B-tree paper talks about the effect of being more
aggressive in finding a split point. You tend to be able to make index
have more leaf pages but fewer internal pages as you get more
aggressive about split points. However, both internal pages and leaf
pages eventually become more numerous than they'd be with a reasonable
interval/level of aggression/discernment -- the saving in internal
pages no longer compensates for having more downlinks in internal
pages. Bayer ends up saying next to nothing about how big the "split
interval" should be.

BTW, somebody named Timothy L. Towns wrote the only analysis I've been
able to find on split interval for "simply prefix B-Trees" (suffix
truncation):

https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1

He is mostly talking about the classic case from Bayer's 77 paper,
where everything is a memcmp()-able string, which is probably what
some systems actually do. On the other hand, I care about attribute
granularity. Anyway, it's pretty clear that this Timothy L. Towns
fellow should have picked a better topic for his thesis, because he
fails to say anything practical about it. Unfortunately, a certain
amount of magic in this area is unavoidable.

+Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf
+tuple that it's based on (the first item on the right page), since a heap
+TID must be appended when nothing else distinguishes each side of a leaf
+split.  Truncation cannot simply reuse the leaf level representation: we
+must append an additional attribute, rather than incorrectly leaving a heap
+TID in the generic IndexTuple item pointer field.  (The field is already
+used by pivot tuples to store their downlink, plus some additional
+metadata.)

That's not really the fault of suffix truncation as such, but the
process of turning a leaf tuple into a pivot tuple. It would happen even
if you didn't truncate anything.

Fair. Will change.

I think this point, that we have to store the heap TID differently in
pivot tuples, would deserve a comment somewhere else, too. While reading
the patch, I didn't realize that that's what we're doing, until I read
this part of the README, even though I saw the new code to deal with
heap TIDs elsewhere in the code. Not sure where, maybe in
BTreeTupleGetHeapTID().

Okay.

This is the first mention of "many duplicates" mode. Maybe just say
"_bt_findsplitloc() almost always ..." or "The logic for selecting a
split point goes to great lengths to avoid heap TIDs in pivots, and
almost always manages to pick a split point between two
user-key-distinct tuples, accepting a completely lopsided split if it must."

Sure.

Once appending a heap
+TID to a split's pivot becomes completely unavoidable, there is a fallback
+strategy --- "single value" mode is used, which makes page splits pack the
+new left half full by using a high fillfactor.  Single value mode leads to
+better overall space utilization when a large number of duplicates are the
+norm, and thereby also limits the total number of pivot tuples with an
+untruncated heap TID attribute.

This assumes that tuples are inserted in increasing TID order, right?
Seems like a valid assumption, no complaints there, but it's an
assumption nevertheless.

I can be explicit about that. See also: my remarks above about
"fillfactor" with single value mode.

I'm not sure if this level of detail is worthwhile in the README. This
logic on deciding the split point is all within the _bt_findsplitloc()
function, so maybe put this explanation there. In the README, a more
high-level explanation of what things _bt_findsplitloc() considers,
should be enough.

Okay.

_bt_findsplitloc(), and all its helper structs and subroutines, are
about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it
would be a good idea to move it to a whole new nbtsplitloc.c file? It's
a very isolated piece of code.

Good idea. I'll give that a go.

In the comment on _bt_leave_natts_fast():

That's an interesting tidbit, but I'd suggest just removing this comment
altogether. It's not really helping to understand the current
implementation.

Will do.

v9-0005-Add-high-key-continuescan-optimization.patch commit message:

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

.. but we're missing the other optimizations that make it more
effective, so it probably won't do much for v3 indexes. Does it make
them slower? It's probably acceptable, even if there's a tiny
regression, but I'm curious.

But v3 indexes get the same _bt_findsplitloc() treatment as v4 indexes
-- the new-item-split stuff works almost as well for v3 indexes, and
the other _bt_findsplitloc() stuff doesn't seem to make much
difference. I'm not sure if that's the right thing to do (probably
doesn't matter very much). Now, to answer your question about v3
indexes + the continuescan optimization: I think that it probably will
help a bit, with or without the _bt_findsplitloc() changes. Much
harder to be sure whether it's worth it on balance, since that's
workload dependent. My sense is that it's a much smaller benefit much
of the time, but the cost is still pretty low. So why not just make it
version-generic, and keep things relatively uncluttered?

Once again, I greatly appreciate your excellent review!
--
Peter Geoghegan

#47

Heikki Linnakangas

hlinnaka@iki.fi

about 7 years ago

In reply to: Peter Geoghegan (#46)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 29/12/2018 01:04, Peter Geoghegan wrote:

However, naively computing the penalty upfront for every offset would be
a bit wasteful. Instead, start from the middle of the page, and walk
"outwards" towards both ends, until you find a "good enough" penalty.

You can't start at the middle of the page, though.

You have to start at the left (though you could probably start at the
right instead). This is because of page fragmentation -- it's not
correct to assume that the line pointer offset into tuple space on the
page (firstright linw pointer lp_off for candidate split point) tells
you anything about what the space delta will be after the split. You
have to exhaustively add up the free space before the line pointer
(the free space for all earlier line pointers) before seeing if the
line pointer works as a split point, since each previous line
pointer's tuple could be located anywhere in the original page's tuple
space (anywhere to the left or to the right of where it would be in
the simple/unfragmented case).

Right. You'll need to do the free space computations from left to right,
but once you have done that, you can compute the penalties in any order.

I'm envisioning that you have an array, with one element for each item
on the page (including the tuple we're inserting, which isn't really on
the page yet). In the first pass, you count up from left to right,
filling the array. Next, you compute the complete penalties, starting
from the middle, walking outwards.

That's not so different from what you're doing now, but I find it more
natural to explain the algorithm that way.

- Heikki

#48

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Heikki Linnakangas (#47)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Right. You'll need to do the free space computations from left to right,
but once you have done that, you can compute the penalties in any order.

I'm envisioning that you have an array, with one element for each item
on the page (including the tuple we're inserting, which isn't really on
the page yet). In the first pass, you count up from left to right,
filling the array. Next, you compute the complete penalties, starting
from the middle, walking outwards.

That's not so different from what you're doing now, but I find it more
natural to explain the algorithm that way.

Ah, right. I think I see what you mean now.

I like that this datastructure explicitly has a place for the new
item, so you really do "pretend it's already on the page". Maybe
that's what you liked about it as well.

I'm a little concerned about the cost of maintaining the data
structure. This sounds workable, but we probably don't want to
allocate a buffer most of the time, or even hold on to the information
most of the time. The current design throws away potentially useful
information that it may later have to recreate, but even that has the
benefit of having little storage overhead in the common case.

Leave it with me. I'll need to think about this some more.

--
Peter Geoghegan

#49

Alexander Korotkov

a.korotkov@postgrespro.ru

about 7 years ago

In reply to: Peter Geoghegan (#16)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Hi!

I'm starting to look at this patchset. Not ready to post detail
review, but have a couple of questions.

On Wed, Sep 19, 2018 at 9:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

I still haven't managed to add pg_upgrade support, but that's my next
step. I am more or less happy with the substance of the patch in v5,
and feel that I can now work backwards towards figuring out the best
way to deal with on-disk compatibility. It shouldn't be too hard --
most of the effort will involve coming up with a good test suite.

Yes, it shouldn't be too hard, but it seems like we have to keep two
branches of code for different handling of duplicates. Is that true?

+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.

I didn't get the point of this paragraph. Does it might happen that
first right tuple is under tuple size restriction, but new pivot tuple
is beyond that restriction? If so, would we have an error because of
too long pivot tuple? If not, I think this needs to be explained
better.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#50

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Alexander Korotkov (#49)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Hi Alexander,

On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

I'm starting to look at this patchset. Not ready to post detail
review, but have a couple of questions.

Thanks for taking a look!

Yes, it shouldn't be too hard, but it seems like we have to keep two
branches of code for different handling of duplicates. Is that true?

Not really. If you take a look at v9, you'll see the approach I've
taken is to make insertion scan keys aware of which rules apply (the
"heapkeyspace" field field controls this). I think that there are
about 5 "if" statements for that outside of amcheck. It's pretty
manageable.

I like to imagine that the existing code already has unique keys, but
nobody ever gets to look at the final attribute. It works that way
most of the time -- the only exception is insertion with user keys
that aren't unique already. Note that the way we move left on equal
pivot tuples, rather than right (rather than following the pivot's
downlink) wasn't invented by Postgres to deal with the lack of unique
keys. That's actually a part of the Lehman and Yao design itself.
Almost all of the special cases are optimizations rather than truly
necessary infrastructure.

I didn't get the point of this paragraph. Does it might happen that
first right tuple is under tuple size restriction, but new pivot tuple
is beyond that restriction? If so, would we have an error because of
too long pivot tuple? If not, I think this needs to be explained
better.

The v9 version of the function _bt_check_third_page() shows what it
means (comments on this will be improved in v10, too). The old limit
of 2712 bytes still applies to pivot tuples, while a new, lower limit
of 2704 bytes applied for non-pivot tuples. This difference is
necessary because an extra MAXALIGN() quantum could be needed to add a
heap TID to a pivot tuple during truncation in the worst case. To
users, the limit is 2704 bytes, because that's the limit that actually
needs to be enforced during insertion.

We never actually say "1/3 of a page means 2704 bytes" in the docs,
since the definition was always a bit fuzzy. There will need to be a
compatibility note in the release notes, though.
--
Peter Geoghegan

#51

Peter Geoghegan

pg@bowt.ie

about 7 years ago

In reply to: Peter Geoghegan (#48)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm envisioning that you have an array, with one element for each item
on the page (including the tuple we're inserting, which isn't really on
the page yet). In the first pass, you count up from left to right,
filling the array. Next, you compute the complete penalties, starting
from the middle, walking outwards.

Ah, right. I think I see what you mean now.

Leave it with me. I'll need to think about this some more.

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, and
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

I did refactor the _bt_findsplitloc() stuff to make the division of
work clearer, though -- I think that you'll find that to be a clear
improvement, even though it's less than what you asked for. I also
moved all of the _bt_findsplitloc() stuff (old and new) into its own
.c file, nbtsplicloc.c, as you suggested.

Other significant changes
=========================

* Creates a new commit that changes routines like _bt_search() and
_bt_binsrch() to use a dedicated insertion scankey struct, per request
from Heikki.

* As I mentioned in passing, many other small changes to comments, the
nbtree README, and the commit messages based on your (Heikki's) first
round of review.

* v10 generalizes the previous _bt_lowest_scantid() logic for adding a
tie-breaker on equal pivot tuples during a descent of a B-Tree.

The new code works with any truncated attribute, not just a truncated
heap TID (I removed _bt_lowest_scantid() entirely). This also allowed
me to remove a couple of places that previously opted in to
_bt_lowest_scantid(), since the new approach can work without anybody
explicitly opting in. As a bonus, the new approach makes the patch
faster, since remaining queries where we unnecessarily follow an
equal-though-truncated downlink are fixed (it's usually only the heap
TID that's truncated when we can do this, but not always).

The idea behind this new generalized approach is to recognize that
minus infinity is an artificial/sentinel value that doesn't appear in
real keys (it only appears in pivot tuples). The majority of callers
(all callers aside from VACUUM's leaf page deletion code) can
therefore go to the right of a pivot that has all-equal attributes, if
and only if:

1. The pivot has at least one truncated/minus infinity attribute *and*

2. The number of attributes matches the scankey.

In other words, we tweak the comparison logic to add a new
tie-breaker. There is no change to the on-disk structures compared to
v9 of the patch -- I've only made index scans able to take advantage
of minus infinity values in *all* cases.

If this explanation is confusing to somebody less experienced with
nbtree than Heikki: consider the way we descend *between* the values
on internal pages, rather than expecting exact matches. _bt_binsrch()
behaves slightly differently when doing a binary search on an internal
page already: equality actually means "go left" when descending the
tree (though it doesn't work like that on leaf pages, where insertion
scankeys almost always search for a >= match). We want to "go right"
instead in cases where it's clear that tuples of interest to our scan
can only be in that child page (we're rarely searching for a minus
infinity value, since that doesn't appear in real tuples). (Note that
this optimization has nothing to do with "moving right" to recover
from concurrent page splits -- we would have relied on code like
_bt_findsplicloc() and _bt_readpage() to move right once we reach the
leaf level when we didn't have this optimization, but that code isn't
concerned with recovering from concurrent page splits.)

Minor changes
=============

* Addresses Heikki's concerns about locking the metapage more
frequently in a general way. Comments are added to nbtpage.c, and
updated in a number of places that already talk about the same risk.

The master branch seems to be doing much the same thing in similar
situations already (e.g. during a root page split, when we need to
finish an interrupted page split but don't have a usable
parent/ancestor page stack). Importantly, the patch does not change
the dependency graph.

* Small changes to user docs where existing descriptions of things
seem to be made inaccurate by the patch.

Benchmarking
============

I have also recently been doing a lot of automated benchmarking. Here
are results of a BenchmarkSQL benchmark (plus various instrumentation)
as a bz2 archive:

https://drive.google.com/file/d/1RVJUzMtMNDi4USg0-Yo56LNcRItbFg1Q/view?usp=sharing

I completed on my home server last night, against v10 of the patch
series. Note that there were 4 runs for each case (master case +
public/patch case), with each run lasting 2 hours (so the benchmark
took over 8 hours once you include bulk loading time). There were 400
"warehouses" (this is similar to pgbench's scale factor), and 16
terminals/clients. This left the database 110GB+ in size on a server
with 32GB of memory and a fast consumer grade SSD. Autovacuum was
tuned to perform aggressive cleanup of bloat. All the settings used
are available in the bz2 archive (there are "settings" output files,
too).

Summary
-------

See the html "report" files for a quick visual indication of how the
tests progresses. BenchmarkSQL uses R to produce useful graphs, which
is quite convenient. (I have automated a lot of this with my own ugly
shellscript.)

We see a small but consistent increase in transaction throughput here,
as well as a small but consistent decrease in average latency for each
class of transaction. There is also a large and consistent decrease in
the on-disk size of indexes, especially if you just consider the
number of internal pages (diff the "balance" files to see what I
mean). Note that the performance is expected to degrade across runs,
since the database is populated once, at the start, and has more data
added over time; the important thing is that run n on master be
compared to run n on public/patch. Note also that I use my own fork of
BenchmarkSQL that does its CREATE INDEX before initial bulk loading,
not after [1]https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan. It'll take longer to see problems on Postgres master if
the initial bulk load does CREATE INDEX after BenchmarkSQL workers
populate tables (we only need INSERTs to see significant index bloat).
Avoiding pristine indexes at the start of the benchmark makes the
problems on the master branch apparent sooner.

The benchmark results also include things like pg_statio* +
pg_stat_bgwriter view output (reset between test runs), which gives
some insight into what's going on. Checkpoints tend to write out a few
more dirty buffers with the patch, while there is a much larger drop
in the number of buffers written out by backends. There are probably
workloads where we'd see a much larger increase in transaction
throughput -- TPC-C happens to access index pages with significant
locality, and happens to be very write-heavy, especially compared to
the more modern (though less influential) TPC-E benchmark. Plus, the
TPC-C workload isn't at all helped by the fact that the patch will
never "get tired", even though that's the most notable improvement
overall.

[1]: https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v10-0002-Add-pg_depend-index-scan-tiebreaker-column.patchapplication/x-patch; name=v10-0002-Add-pg_depend-index-scan-tiebreaker-column.patchDownload

From bb9527ca503edd591dc84e152079b24e1e7401b6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 13 Nov 2018 18:14:23 -0800
Subject: [PATCH v10 2/7] Add pg_depend index scan tiebreaker column.

findDependentObjects() and other code that scans pg_depend evolved to
depend on pg_depend_depender_index and pg_depend_reference_index scan
order.  This is clear from running the regression tests with
"ignore_system_indexes=on": much of the test output changes for
regression tests that happen to have DROP diagnostic messages.  More
importantly, a small number of those test failures involve alternative
messages that are objectively useless or even harmful.  These
regressions all involve internal dependencies of one form or another.
For example, we can go from a HINT that suggests dropping a partitioning
parent table's trigger in order to drop the child's trigger to a HINT
that suggests simply dropping the child table.  This HINT is technically
still correct from the point of view of findDependentObjects(), and yet
it's clearly not the intended user-visible behavior.

Make dependency.c take responsibility for its dependency on scan order
by commenting on it directly.  Ensure that the behavior of
findDependentObjects() is deterministic in the event of duplicates by
adding a new per-backend sequentially assigned number column to
pg_depend.  Both indexes now use this new column as a final trailing
attribute, effectively making it a tiebreaker wherever processing order
happens to matter.  The new column is a per-backend sequentially
assigned number.  New values are assigned in decreasing order.  That
produces the behavior that's already expected from nbtree scans in the
event of duplicate index entries (nbtree usually leaves duplicate index
entries in reverse insertion order at present).  A similar new column is
not needed for pg_shdepend because the aforementioned harmful changes
only occur with cases involving internal dependencies.

The overall effect is to stabilize the behavior of DROP diagnostic
messages, making it possible to avoid the "\set VERBOSITY=terse" hack
that has been used to paper over test instability in the past.  We may
wish to go through the regression tests and remove existing instances of
the "\set VERBOSITY=terse" hack (see also: commit 8e753726), but that's
left for later.

An upcoming patch to make nbtree store duplicate entries in a
well-defined order by treating heap TID as a tiebreaker tuple attribute
more or less flips the order that duplicates appear (the order will
change from usually descending to perfectly ascending).  That change in
the order of duplicates has undesirable side effects relating to
diagnostic messages generated by findDependentObjects() callers.  Those
problematic semantic changes are avoided by this pg_depend groundwork.

Note that adding the new column has no appreciable storage overhead.
pg_depend indexes are made no larger, at least on 64-bit platforms,
because values can fit in a hole that was previously unused due to
alignment -- both pg_depend_depender_index and pg_depend_reference_index
continue to have 24 byte IndexTuples.  There is also no change in the
on-disk size of pg_depend heap relations on 64-bit platforms, for the
same reason.  The MAXALIGN()'d size of pg_depend heap tuples remains 56
bytes (including tuple header overhead).

Discussion: https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com
Discussion: https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us
---
 doc/src/sgml/catalogs.sgml                | 17 ++++++++-
 src/backend/catalog/dependency.c          | 10 ++++++
 src/backend/catalog/pg_depend.c           | 11 +++++-
 src/bin/initdb/initdb.c                   | 44 +++++++++++------------
 src/include/catalog/indexing.h            |  4 +--
 src/include/catalog/pg_depend.h           |  1 +
 src/test/regress/expected/alter_table.out |  2 +-
 src/test/regress/expected/misc_sanity.out |  4 +--
 8 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index af4d0625ea..be062dc8a2 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2937,6 +2937,20 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
       </entry>
      </row>
 
+     <row>
+      <entry><structfield>depcreate</structfield></entry>
+      <entry><type>int4</type></entry>
+      <entry></entry>
+      <entry>
+       A per-backend sequentially assigned number for this dependency
+       relationship.  Used as a tiebreaker in the event of multiple
+       internal dependency relationships of otherwise equal
+       precedence.  Identifiers are assigned in descending order to
+       ensure that the most recently entered dependency is the one
+       referenced by <literal>HINT</literal> fields.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -3054,7 +3068,8 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
        that the system itself depends on the referenced object, and so
        that object must never be deleted.  Entries of this type are
        created only by <command>initdb</command>.  The columns for the
-       dependent object contain zeroes.
+       dependent object and the <structfield>depcreate</structfield>
+       column contain zeroes.
       </para>
      </listitem>
     </varlistentry>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index dc679ed8b9..f4c87a88f8 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -532,6 +532,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependDependerIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., objsubid = 0 entries will be
+	 * processed before other entries for the same dependent object.)
+	 */
 	scan = systable_beginscan(*depRel, DependDependerIndexId, true,
 							  NULL, nkeys, key);
 
@@ -727,6 +732,11 @@ findDependentObjects(const ObjectAddress *object,
 	else
 		nkeys = 2;
 
+	/*
+	 * Note that we rely on DependReferenceIndexId scan order to make
+	 * diagnostic messages deterministic.  (e.g., refobjsubid = 0 entries will
+	 * be processed before other entries for the same referenced object.)
+	 */
 	scan = systable_beginscan(*depRel, DependReferenceIndexId, true,
 							  NULL, nkeys, key);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index fde7e170be..9197dd0665 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -29,6 +29,8 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+/* Per-backend pg_depend tiebreaker value */
+static int32 depcreate = PG_INT32_MAX;
 
 static bool isObjectPinned(const ObjectAddress *object, Relation rel);
 
@@ -93,7 +95,10 @@ recordMultipleDependencies(const ObjectAddress *depender,
 		{
 			/*
 			 * Record the Dependency.  Note we don't bother to check for
-			 * duplicate dependencies; there's no harm in them.
+			 * duplicate dependencies; there's no harm in them.  Note that
+			 * depcreate ensures deterministic processing among dependencies
+			 * of otherwise equal precedence (e.g., among multiple entries of
+			 * the same refclassid + refobjid + refobjsubid).
 			 */
 			values[Anum_pg_depend_classid - 1] = ObjectIdGetDatum(depender->classId);
 			values[Anum_pg_depend_objid - 1] = ObjectIdGetDatum(depender->objectId);
@@ -104,6 +109,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			values[Anum_pg_depend_refobjsubid - 1] = Int32GetDatum(referenced->objectSubId);
 
 			values[Anum_pg_depend_deptype - 1] = CharGetDatum((char) behavior);
+			values[Anum_pg_depend_depcreate - 1] = Int32GetDatum(depcreate--);
 
 			tup = heap_form_tuple(dependDesc->rd_att, values, nulls);
 
@@ -114,6 +120,9 @@ recordMultipleDependencies(const ObjectAddress *depender,
 			CatalogTupleInsertWithInfo(dependDesc, tup, indstate);
 
 			heap_freetuple(tup);
+			/* avoid signed underflow */
+			if (depcreate == PG_INT32_MIN)
+				depcreate = PG_INT32_MAX;
 		}
 	}
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e55ba668ce..5b2c7b2ccd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1618,55 +1618,55 @@ setup_depend(FILE *cmdfd)
 		"DELETE FROM pg_shdepend;\n\n",
 		"VACUUM pg_shdepend;\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_class;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_proc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_type;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_cast;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_constraint;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_conversion;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_attrdef;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_language;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_operator;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opclass;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_opfamily;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_am;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amop;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_amproc;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_rewrite;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_trigger;\n\n",
 
 		/*
 		 * restriction here to avoid pinning the public namespace
 		 */
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_namespace "
 		"    WHERE nspname LIKE 'pg%';\n\n",
 
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_parser;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_dict;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_template;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_ts_config;\n\n",
-		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p' "
+		"INSERT INTO pg_depend SELECT 0,0,0, tableoid,oid,0, 'p',0 "
 		" FROM pg_collation;\n\n",
 		"INSERT INTO pg_shdepend SELECT 0,0,0,0, tableoid,oid, 'p' "
 		" FROM pg_authid;\n\n",
diff --git a/src/include/catalog/indexing.h b/src/include/catalog/indexing.h
index 833fad1f6a..b1276ccf96 100644
--- a/src/include/catalog/indexing.h
+++ b/src/include/catalog/indexing.h
@@ -142,9 +142,9 @@ DECLARE_UNIQUE_INDEX(pg_database_datname_index, 2671, on pg_database using btree
 DECLARE_UNIQUE_INDEX(pg_database_oid_index, 2672, on pg_database using btree(oid oid_ops));
 #define DatabaseOidIndexId	2672
 
-DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops));
+DECLARE_INDEX(pg_depend_depender_index, 2673, on pg_depend using btree(classid oid_ops, objid oid_ops, objsubid int4_ops, depcreate int4_ops));
 #define DependDependerIndexId  2673
-DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops));
+DECLARE_INDEX(pg_depend_reference_index, 2674, on pg_depend using btree(refclassid oid_ops, refobjid oid_ops, refobjsubid int4_ops, depcreate int4_ops));
 #define DependReferenceIndexId	2674
 
 DECLARE_UNIQUE_INDEX(pg_description_o_c_o_index, 2675, on pg_description using btree(objoid oid_ops, classoid oid_ops, objsubid int4_ops));
diff --git a/src/include/catalog/pg_depend.h b/src/include/catalog/pg_depend.h
index f786445fb2..4ad1d7b33b 100644
--- a/src/include/catalog/pg_depend.h
+++ b/src/include/catalog/pg_depend.h
@@ -61,6 +61,7 @@ CATALOG(pg_depend,2608,DependRelationId)
 	 * field.  See DependencyType in catalog/dependency.h.
 	 */
 	char		deptype;		/* see codes in dependency.h */
+	int32		depcreate;		/* per-backend identifier; tiebreaker */
 } FormData_pg_depend;
 
 /* ----------------
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 27cf550396..7bb8ca9128 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2583,10 +2583,10 @@ DETAIL:  drop cascades to table alter2.t1
 drop cascades to view alter2.v1
 drop cascades to function alter2.plus1(integer)
 drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
 drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to text search parser alter2.prs
 drop cascades to text search configuration alter2.cfg
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 8538173ff8..2f299e0adc 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -18,8 +18,8 @@ WHERE refclassid = 0 OR refobjid = 0 OR
       deptype NOT IN ('a', 'e', 'i', 'n', 'p') OR
       (deptype != 'p' AND (classid = 0 OR objid = 0)) OR
       (deptype = 'p' AND (classid != 0 OR objid != 0 OR objsubid != 0));
- classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype 
----------+-------+----------+------------+----------+-------------+---------
+ classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype | depcreate 
+---------+-------+----------+------------+----------+-------------+---------+-----------
 (0 rows)
 
 -- **************** pg_shdepend ****************
-- 
2.17.1

v10-0001-Refactor-nbtree-insertion-scankeys.patchapplication/x-patch; name=v10-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From a18079b7e9a109f65f1ea3b1a09ec7d3c523eaa8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v10 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  (It also allows us to store mutable state in
the insertion scankey, which may one day be used to reduce the amount of
calls to support function 1 comparators during the initial descent of a
B-Tree index [1].)

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() calls now fall out immediately in
the common case where it's already clear that there couldn't possibly be
a duplicate.

Importantly, the new _bt_check_unique() scheme makes it a lot easier to
manage cached binary search effort within _bt_findinsertloc().  This
matters a lot to the upcoming patch to make nbtree tuples unique by
treating heap TID as a tie-breaker attribute, since it allows
_bt_findinsertloc() to sensibly deal with pre-pg_upgrade indexes and
unique indexes as special cases.

Based on a suggestion by Andrey Lepikhov.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  83 ++++---
 src/backend/access/nbtree/nbtinsert.c | 300 +++++++++++++++-----------
 src/backend/access/nbtree/nbtpage.c   |  11 +-
 src/backend/access/nbtree/nbtsearch.c | 226 ++++++++++++-------
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/nbtree/nbtutils.c  |  39 ++--
 src/backend/utils/sort/tuplesort.c    |   4 +-
 src/include/access/nbtree.h           |  62 ++++--
 8 files changed, 431 insertions(+), 296 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a8a0ec70e1..c7cdca3962 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -125,25 +125,23 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
-static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
-					 OffsetNumber lowerbound);
+static inline bool invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
@@ -834,8 +832,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,7 +900,7 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/* Build insertion scankey for current page offset/tuple */
 		skey = _bt_mkscankey(state->rel, itup);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
@@ -959,8 +957,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * current item is less than or equal to next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_leq_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1014,20 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			IndexTuple	righttup;
+			BTScanInsert rightkey;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			/* Set up right item scankey */
+			if (righttup)
+				rightkey = _bt_mkscankey(state->rel, righttup);
+
+			if (righttup && !invariant_geq_offset(state, rightkey, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1070,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1083,9 +1084,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1099,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1288,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1304,8 +1304,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1410,8 +1410,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1759,13 +1758,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1778,13 +1776,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1800,14 +1797,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 858df3b766..cbc07d316b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -52,19 +52,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -84,8 +84,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -111,15 +111,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
+	Page		page;
 	OffsetNumber offset;
+	BTPageOpaque lpageop;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -149,8 +148,6 @@ top:
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -180,8 +177,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -220,8 +216,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -245,13 +240,21 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * Arrange for the later _bt_findinsertloc call to _bt_binsrch to
+		 * avoid repeating the work done during this initial _bt_binsrch call
+		 */
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -278,6 +281,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -289,9 +294,9 @@ top:
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf, checkingunique,
+									  itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -302,7 +307,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -327,13 +332,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -344,6 +348,10 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
 
+	/* _bt_binsrch() alone may determine that there are no duplicates */
+	if (itup_scankey->low >= itup_scankey->high)
+		goto notfound;
+
 	InitDirtySnapshot(SnapshotDirty);
 
 	page = BufferGetPage(buf);
@@ -392,7 +400,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -553,11 +561,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -577,6 +588,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		}
 	}
 
+notfound:
+
 	/*
 	 * If we are doing a recheck then we should have found the tuple we are
 	 * checking.  Otherwise there's something very wrong --- probably, the
@@ -599,7 +612,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
  *		If the new key is equal to one or more existing keys, we can
  *		legitimately place it anywhere in the series of equal keys --- in fact,
@@ -612,39 +625,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -673,55 +687,24 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
+		int			cmpval;
+
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -764,27 +747,92 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform microvacuuming of the page we're about to insert tuple on if it
+	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
+	 * forestall a page split, though.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+	Assert(!itup_scankey->restorebinsrch);
+	/* XXX: may use too many cycles to be a simple assertion */
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
+ *		insert on the page contained in buf when a choice must be made.
+ *		Preemptive microvacuuming is performed here when that could allow
+ *		caller to insert on to the page in buf.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right (caller
+ *		must always be able to still move right following call here).
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -1190,8 +1238,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user
-	 * data.
+	 * rightmost page case), all the items on the right half will be user data
+	 * (there is no existing high key that needs to be relocated to the new
+	 * right page).
 	 */
 	rightoff = P_HIKEY;
 
@@ -2312,24 +2361,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 1d72fe5408..d0cf73718f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1370,7 +1370,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,11 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
+
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b5244aa213..9e44e88190 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -72,12 +72,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.  If key was built
+ * using a leaf high key, leaf page will be relocated.
  *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
@@ -94,8 +91,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -131,7 +128,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -145,7 +142,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -199,8 +196,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -216,16 +213,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -242,10 +240,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -270,7 +266,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -305,7 +301,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -325,13 +321,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,37 +336,70 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				savehigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	isleaf = P_ISLEAF(opaque);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available slot.
+		 * Note this covers two cases: the page is really empty (no keys), or
+		 * it contains only a high key.  The latter case is possible after
+		 * vacuuming.  This can never happen on an internal page, however,
+		 * since they are never empty (an internal page must have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				key->low = low;
+				key->high = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;					/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->high;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (high < low)
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -391,22 +413,37 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	savehigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				savehigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->high = savehigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -416,12 +453,13 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
 	 * On a non-leaf page, return the last key < scan key (resp. <= scan key).
-	 * There must be one if _bt_compare() is playing by the rules.
+	 * There must be one if _bt_compare()/_bt_tuple_compare() is playing by
+	 * the rules.
 	 */
 	Assert(low > P_FIRSTDATAKEY(opaque));
 
@@ -431,21 +469,11 @@ _bt_binsrch(Relation rel,
 /*----------
  *	_bt_compare() -- Compare scankey to a particular tuple on the page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * Convenience wrapper for _bt_tuple_compare() callers that want to compare
+ * an offset on a particular page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
- *		This routine returns:
- *			<0 if scankey < tuple at offnum;
- *			 0 if scankey == tuple at offnum;
- *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
- *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
  * scankey.  The actual key value stored (if any, which there probably isn't)
@@ -456,15 +484,12 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
-	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
 
 	Assert(_bt_check_natts(rel, page, offnum));
 
@@ -476,6 +501,35 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	return _bt_tuple_compare(rel, key, itup, key->keysz);
+}
+
+/*----------
+ *	_bt_tuple_compare() -- Compare scankey to a particular tuple.
+ *
+ * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * but it can omit the rightmost column(s) of the index.
+ *
+ *		This routine returns:
+ *			<0 if scankey < tuple;
+ *			 0 if scankey == tuple;
+ *			>0 if scankey > tuple.
+ *		NULLs in the keys are treated as sortable values.  Therefore
+ *		"equality" does not necessarily mean that the item should be
+ *		returned to the caller as a matching key!
+ *----------
+ */
+int32
+_bt_tuple_compare(Relation rel,
+				  BTScanInsert key,
+				  IndexTuple itup,
+				  int ncmpkey)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			i;
+	ScanKey		scankey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -489,7 +543,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -575,8 +630,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -822,8 +877,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -851,7 +907,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -883,7 +939,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -929,7 +986,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -950,7 +1007,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1053,12 +1110,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scankey fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.high = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1087,7 +1149,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d9b9229ab7..933fb4dfe7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1115,7 +1115,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
+		pfree(indexScanKey);
 
 		for (;;)
 		{
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..69d67fb428 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -58,32 +58,38 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *
  *		The result is intended for use with _bt_compare().
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert inskey;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	inskey = palloc(offsetof(BTScanInsertData, scankeys) +
+					sizeof(ScanKeyData) * indnkeyatts);
+	inskey->savebinsrch = inskey->restorebinsrch = false;
+	inskey->low = inskey->high = InvalidOffsetNumber;
+	inskey->nextkey = false;
+	inskey->keysz = Min(indnkeyatts, tupnatts);
+	skey = inskey->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -108,19 +114,17 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
+	return inskey;
 }
 
 /*
  * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
+ *		Build a raw insertion scan key that contains 3-way comparator routines
  *		appropriate to the key datatypes, but no comparison data.  The
  *		comparison data ultimately used must match the key datatypes.
  *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
+ *		Currently this routine is only called by nbtsort.c and tuplesort.c,
+ *		which have their own comparison routines.
  */
 ScanKey
 _bt_mkscankey_nodata(Relation rel)
@@ -159,15 +163,6 @@ _bt_mkscankey_nodata(Relation rel)
 	return skey;
 }
 
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
-}
-
 /*
  * free a retracement stack made by _bt_search.
  */
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..489eee095e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..c8fd036c9e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.  For details on its mutable state, see _bt_binsrch and
+ * _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber high;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,14 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup,
+				  int ncmpkey);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +615,8 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v10-0004-Pick-nbtree-split-points-discerningly.patchapplication/x-patch; name=v10-0004-Pick-nbtree-split-points-discerningly.patchDownload

From c9b663d4f6f61089625012f90c45b9f951a83cea Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v10 4/7] Pick nbtree split points discerningly.

Add infrastructure to weigh how effective suffix truncation will be when
choosing a split point.  This should not noticeably affect the balance
of free space within each half of the split, while still making suffix
truncation truncate away significantly more attributes on average.

The logic for choosing a split point is also taught to care about the
case where there are many duplicates, making it hard to find a
distinguishing split point.  It may even conclude that the page being
split is already full of logical duplicates, in which case it packs the
left half very tightly, while leaving the right half mostly empty.  Our
assumption is that logical duplicates will almost always be inserted in
ascending heap TID order.  This strategy leaves most of the free space
on the half of the split that will likely be where future logical
duplicates of the same value need to be placed.

The number of cycles added is not very noticeable, which is important,
since the decision of where to split a page is made while an exclusive
buffer lock is held.  We avoid using authoritative insertion scankey
comparisons to save cycles, unlike suffix truncation proper.  We use a
faster binary comparison instead.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 295 +--------
 src/backend/access/nbtree/nbtsplitloc.c | 822 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  67 ++
 src/include/access/nbtree.h             |  29 +-
 6 files changed, 968 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index cb9ed61599..4fb06fa6e2 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -165,9 +165,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -669,6 +669,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 04c6023cba..0741ba455c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -29,26 +29,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -75,13 +55,6 @@ static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 		  IndexTuple newitem, bool newitemonleft, bool truncate);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -879,8 +852,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -974,7 +946,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/*
@@ -1318,6 +1290,11 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * to go into the new right page, or possibly a truncated version if this
 	 * is a leaf page split.  This might be either the existing data item at
 	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page.  Despite appearances, the new high key is generated in a way
+	 * that's consistent with their approach.  See comments above
+	 * _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1659,264 +1636,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..86cde0206c
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,822 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* _bt_dofindsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_checksplitloc */
+	SplitMode	mode;			/* strategy for deciding split point */
+	Size		newitemsz;		/* size of new item to be inserted */
+	double		propfullonleft; /* want propfullonleft * leftfree on left */
+	int			goodenough;		/* good enough left/right space delta */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_weighted;	/* T if propfullonleft used by split */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
+} FindSplitData;
+
+static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
+				   SplitMode mode, OffsetNumber newitemoff,
+				   Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright, bool newitemonleft,
+				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of.  (This could be
+ * maxoff+1 if the tuple is to go at the end.)
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key.  It isn't, though;
+ * suffix truncation will leave the left page's high key equal to the last
+ * item on the left page when two tuples with equal key values enclose the
+ * split point.  It's convenient to always express a split point as a
+ * firstright offset due to internal page splits, which leave us with a
+ * right half whose first item becomes a negative infinity item through
+ * truncation to 0 attributes.  In effect, internal page splits store
+ * firstright's "separator" key at the end of the left page (as left's new
+ * high key), and store its downlink at the start of the right page.  In
+ * other words, internal page splits conceptually split in the middle of the
+ * firstright tuple, not on either side of it.  Crucially, when splitting
+ * either a leaf page or an internal page, the new high key will be strictly
+ * less than the first item on the right page in all cases, despite the fact
+ * that we start with the assumption that firstright becomes the new high
+ * key.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	/* Initial call always uses SPLIT_DEFAULT */
+	return _bt_dofindsplitloc(rel, page, SPLIT_DEFAULT, newitemoff, newitemsz,
+							  newitem, newitemonleft);
+}
+
+/*
+ *	_bt_dofindsplitloc() -- guts of find split location code.
+ *
+ * We give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case.  There
+ * is still a modest benefit to choosing a split location while weighing
+ * suffix truncation: the resulting (untruncated) pivot tuples are
+ * nevertheless more predictive of future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.
+ *
+ * When called recursively in single value mode we try to arrange to leave
+ * the left split page even more full than in the fillfactor% rightmost page
+ * case.  This maximizes space utilization in cases where tuples with the
+ * same attribute values span across many pages.  Newly inserted duplicates
+ * will tend to have higher heap TID values, so we'll end up splitting to
+ * the right in the manner of ascending insertions of monotonically
+ * increasing values.  See nbtree/README for more information about suffix
+ * truncation, and how a split point is chosen.
+ */
+static OffsetNumber
+_bt_dofindsplitloc(Relation rel,
+				   Page page,
+				   SplitMode mode,
+				   OffsetNumber newitemoff,
+				   Size newitemsz,
+				   IndexTuple newitem,
+				   bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	OffsetNumber offnum;
+	OffsetNumber maxoff;
+	ItemId		itemid;
+	FindSplitData state;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	bool		goodenoughfound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items without actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.mode = mode;
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.is_leaf = P_ISLEAF(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+	state.splits = splits;
+	state.nsplits = 0;
+	if (!state.is_leaf)
+	{
+		Assert(state.mode == SPLIT_DEFAULT);
+
+		/* propfullonleft only used on rightmost page */
+		state.propfullonleft = BTREE_NONLEAF_FILLFACTOR / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		/* See is_leaf default mode remarks on maxsplits */
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	}
+	else if (state.mode == SPLIT_DEFAULT)
+	{
+		if (P_RIGHTMOST(opaque))
+		{
+			/*
+			 * Rightmost page splits are always weighted.  Extreme contention
+			 * on the rightmost page is relatively common, so we treat it as a
+			 * special case.
+			 */
+			state.propfullonleft = leaffillfactor / 100.0;
+			state.is_weighted = true;
+		}
+		else
+		{
+			/* propfullonleft won't be used, but be tidy */
+			state.propfullonleft = 0.50;
+			state.is_weighted = false;
+		}
+
+		/*
+		 * Set an initial limit on the split interval/number of candidate
+		 * split points as appropriate.  The "Prefix B-Trees" paper refers to
+		 * this as sigma l for leaf splits and sigma b for internal ("branch")
+		 * splits.  It's hard to provide a theoretical justification for the
+		 * size of the split interval, though it's clear that a small split
+		 * interval improves space utilization.
+		 */
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	}
+	else if (state.mode == SPLIT_MANY_DUPLICATES)
+	{
+		state.propfullonleft = leaffillfactor / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		state.maxsplits = maxoff + 2;
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	}
+	else
+	{
+		Assert(state.mode == SPLIT_SINGLE_VALUE);
+
+		state.propfullonleft = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		state.is_weighted = true;
+		state.maxsplits = 1;
+	}
+
+	/*
+	 * Finding the best possible split would require checking all the possible
+	 * split points, because of the high-key and left-key special cases.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as we find sufficiently-many "good-enough"
+	 * splits, where good-enough is defined as an imbalance in free space of
+	 * no more than pagesize/16 (arbitrary...) This should let us stop near
+	 * the middle on most pages, instead of plowing to the end.  Many
+	 * duplicates mode must consider all possible choices, and so does not use
+	 * this threshold for anything.
+	 */
+	state.goodenough = leftspace / 16;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.
+	 */
+	olddataitemstoleft = 0;
+	goodenoughfound = false;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+		int			delta;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
+
+		else if (offnum < newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		else
+		{
+			/* need to try it both ways! */
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
+
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		}
+
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
+			goodenoughfound = true;
+
+		/*
+		 * Abort scan once we've found a good-enough choice, and reach the
+		 * point where we stop finding new good-enough choices.  Don't do this
+		 * in many duplicates mode, though, since that must be almost
+		 * completely exhaustive.
+		 */
+		if (goodenoughfound && state.mode != SPLIT_MANY_DUPLICATES &&
+			delta > state.goodenough)
+			break;
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, check for splitting so that all
+	 * the old items go to the left page and the new item goes to the right
+	 * page.
+	 */
+	if (newitemoff > maxoff && !goodenoughfound)
+		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to choose a split point whose new high key is amenable to
+	 * being made smaller by suffix truncation, or is already small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  The perfect penalty often saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.  This is also a
+	 * convenient point to determine if a second pass over page is required.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Perform second pass over page when _bt_perfect_penalty() tells us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_dofindsplitloc(rel, page, secondmode, newitemoff,
+								  newitemsz, newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
+}
+
+/*
+ * Subroutine to analyze a particular possible split choice (ie, firstright
+ * and newitemonleft settings), and record the best split so far in *state.
+ *
+ * firstoldonright is the offset of the first item on the original page
+ * that goes to the right page, and firstoldonrightsz is the size of that
+ * tuple. firstoldonright can be > max offset, which means that all the old
+ * items go to the left page and only the new item goes to the right page.
+ * In that case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of
+ * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
+ */
+static int
+_bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright,
+				  bool newitemonleft,
+				  int olddataitemstoleft,
+				  Size firstoldonrightsz)
+{
+	int			leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int) (firstrightitemsz +
+						   MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int) state->newitemsz;
+	else
+		rightfree -= (int) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int) firstrightitemsz -
+			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/*
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
+	 */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		int			delta;
+
+		if (state->is_weighted)
+			delta = state->propfullonleft * leftfree -
+				(1.0 - state->propfullonleft) * rightfree;
+		else
+			delta = leftfree - rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway.
+		 *
+		 * We cannot do this in many duplicates mode, because that hurts cases
+		 * where there are a small number of available distinguishing split
+		 * points, and consistently picking the least worst choice among them
+		 * matters. (e.g., a non-unique index whose leaf pages each contain a
+		 * small number of distinct values, with each value duplicated a
+		 * uniform number of times.)
+		 */
+		if (delta > state->goodenough && state->mode != SPLIT_MANY_DUPLICATES)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
+		{
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/*
+	 * No point in calculating penalty when there's only one choice.  Note
+	 * that single value mode always has one choice.
+	 */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	Assert(state->mode == SPLIT_DEFAULT ||
+		   state->mode == SPLIT_MANY_DUPLICATES);
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are generally discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 *
+	 * Single value mode splits only occur when appending a heap TID was
+	 * already deemed necessary.  Don't waste any more cycles trying to avoid
+	 * it.
+	 */
+	if (state->mode == SPLIT_MANY_DUPLICATES)
+		return indnkeyatts;
+	else if (state->mode == SPLIT_SINGLE_VALUE)
+		return indnkeyatts + 1;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = indnkeyatts;
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller should perform, if any, when
+	 * even their "perfect" penalty fails to avoid appending a heap TID to new
+	 * pivot tuple
+	 */
+	if (perfectpenalty > indnkeyatts)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			origpagepenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates and it appears that the duplicates have been
+		 * inserted in sequential order (i.e. heap TID order), a single value
+		 * mode pass will be performed.
+		 *
+		 * Deliberately ignore new item here, since a split that leaves only
+		 * one item on either page is often deemed unworkable by
+		 * _bt_checksplitloc().
+		 */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		leftmost = (IndexTuple) PageGetItem(page, itemid);
+		itemid = PageGetItemId(page, maxoff);
+		rightmost = (IndexTuple) PageGetItem(page, itemid);
+		origpagepenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+
+		if (origpagepenalty <= indnkeyatts)
+			*secondmode = SPLIT_MANY_DUPLICATES;
+		else if (newitemoff > maxoff)
+			*secondmode = SPLIT_SINGLE_VALUE;
+
+		/*
+		 * Have caller continue with original default mode split when new
+		 * duplicate item would not go at the end of the page.  Out-of-order
+		 * duplicate insertions predict further inserts towards the
+		 * left/middle of the original page's keyspace.  Evenly sharing space
+		 * among each half of the split avoids pathological performance.
+		 */
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.  Internal page splits always use default mode.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	firstright;
+	IndexTuple	lastleft;
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		OffsetNumber lastleftoff;
+
+		lastleftoff = OffsetNumberPrev(split->firstright);
+		itemid = PageGetItemId(page, lastleftoff);
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastleft != firstright);
+	return _bt_keep_natts_fast(rel, lastleft, firstright);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 23d75ad604..43f7bfcb44 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2390,6 +2391,72 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	/*
+	 * Using authoritative comparisons makes no difference in almost all
+	 * cases.  However, there are a small number of shipped opclasses where
+	 * there might occasionally be an inconsistency between the answers given
+	 * by this function and _bt_keep_natts().  This includes numeric_ops,
+	 * since display scale might vary among logically equal datums.
+	 * Case-insensitive collations may also be interesting.
+	 *
+	 * This is assumed to be okay, since there is no risk that inequality will
+	 * look like equality.  Suffix truncation may be less effective than it
+	 * could be in these narrow cases, but it should be impossible for caller
+	 * to spuriously perform a second pass to find a split location, where
+	 * evenly splitting the page is given secondary importance.
+	 */
+#ifdef AUTHORITATIVE_COMPARE_TEST
+	return _bt_keep_natts(rel, lastleft, firstright, false);
+#endif
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d0eb68e3d2..adcb40e308 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -168,11 +168,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -418,7 +422,19 @@ typedef BTStackData *BTStack;
  * optimization cannot be used, since a leaf page is relocated using its high
  * key).  This optimization allows us to get the full benefit of suffix
  * truncation, particularly with indexes where each distinct set of user
- * attribute keys appear in at least a few duplicate entries.
+ * attribute keys appear in at least a few duplicate entries.  The split point
+ * location logic goes to great lengths to make groups of duplicates all
+ * appear together on a single leaf page after the split; subsequent searches
+ * should avoid unnecessarily reading and processing the sibling that's to the
+ * left of the leaf page that matching entries can first appear on.  Some
+ * later insertion scankey attribute could break the would-be tie with a
+ * truncated/minus infinity attribute, but when that doesn't happen this
+ * optimization breaks the would-be tie instead.  This optimization is even
+ * effective with unique index insertion, where a scantid value is not used
+ * until we reach the leaf level.  It might be necessary to visit multiple
+ * leaf pages during unique checking, but only in the rare case where more
+ * than a single leaf page can store duplicates (concurrent page splits are
+ * another possible reason).
  *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
@@ -679,6 +695,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -748,6 +771,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, bool build);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 					OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v10-0005-Add-split-at-new-tuple-page-split-optimization.patchapplication/x-patch; name=v10-0005-Add-split-at-new-tuple-page-split-optimization.patchDownload

From e9135038d429fe8dc47db746e9fa04627fcd604c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v10 5/7] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing cases where a newly inserted tuple has a heap TID
that's slightly greater than that of the existing tuple to the immediate
left, but isn't just a duplicate.  It can greatly help space utilization
to split between two groups of localized monotonically increasing
values.

Without this patch, affected cases will reliably leave leaf pages no
more than about 50% full.  50/50 page splits are only appropriate with a
pattern of truly random insertions.  The optimization is very similar to
the long established fillfactor optimization used during rightmost page
splits, where we usually leave the new left side of the split 90% full.
Split-at-new-tuple page splits target essentially the same case. The
splits targeted are those at the rightmost point of a localized grouping
of values, rather than those at the rightmost point of the entire key
space.

This enhancement is very effective at avoiding index bloat when initial
bulk INSERTs for the TPC-C benchmark are run, and throughout the TPC-C
benchmark.  The TPC-C issue has been independently observed and reported
on [1].  Evidently, the primary keys for all of the largest indexes in
the TPC-C schema are populated through localized, monotonically
increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.

The problem can easily be recreated by bulk loading using benchmarkSQL
(a fair use TPC-C implementation) while avoiding building indexes with
CREATE INDEX [2].  Note that the patch series generally has less of an
advantage over master if the indexes are initially built with CREATE
INDEX (use my fork of BenchmarkSQL [3] to run a TPC-C benchmark while
avoiding having CREATE INDEX mask the problems on the master branch).

[1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
[2] https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc
[3] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
---
 src/backend/access/nbtree/nbtsplitloc.c | 164 ++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 86cde0206c..c707e2f4c6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,9 @@ static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_splitatnewitem(Relation rel, Page page, int leaffillfactor,
+				   OffsetNumber newitemoff, IndexTuple newitem,
+				   double *propfullonleft);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -72,6 +75,7 @@ static int _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
 					SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 
 
 /*
@@ -243,6 +247,12 @@ _bt_dofindsplitloc(Relation rel,
 			state.propfullonleft = leaffillfactor / 100.0;
 			state.is_weighted = true;
 		}
+		else if (_bt_splitatnewitem(rel, page, leaffillfactor, newitemoff,
+									newitem, &state.propfullonleft))
+		{
+			/* propfullonleft was set for us */
+			state.is_weighted = true;
+		}
 		else
 		{
 			/* propfullonleft won't be used, but be tidy */
@@ -540,6 +550,132 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at the
+ * point that the new/incoming item would have been inserted, leaving the
+ * incoming tuple as the last tuple on the new left page.  When the new item
+ * is at the first or last offset, a fillfactor is applied so that space
+ * utilization is comparable to the traditional rightmost split case.
+ *
+ * This routine is primarily concerned with composite indexes that consist
+ * of one or more leading columns that describe some grouping, plus a
+ * trailing, monotonically increasing column.  This usage pattern is
+ * prevalent in many real world applications.  Consider the example of a
+ * composite index on (supplier_id, invoice_id), where there are a small,
+ * nearly-fixed number of suppliers, and invoice_id is a monotonically
+ * increasing identifier (it doesn't matter whether or not suppliers are
+ * assigned invoice_id values from the same counter, or their own counter).
+ * Without this optimization, approximately 50% of space in leaf pages will
+ * be wasted by unweighted/50:50 page splits.  With this optimization, space
+ * utilization will be close to optimal.  There may be excessive amounts of
+ * free space remaining on right pages where only one supplier is
+ * represented if the supplier has few distinct invoice_id values, but that
+ * problem should be self-limiting.
+ *
+ * Secondarily, DESC-ordered insertions are recognized here, though not for
+ * single attribute indexes, where explicitly using DESC ordering doesn't
+ * make sense.  It seems worthwhile to try to get rightmost style space
+ * utilization for cases like explicitly-DESC date columns.
+ *
+ * Caller uses propfullonleft rather than using the new item offset directly
+ * because not all offsets will be deemed legal as split points.
+ */
+static bool
+_bt_splitatnewitem(Relation rel, Page page, int leaffillfactor,
+				   OffsetNumber newitemoff, IndexTuple newitem,
+				   double *propfullonleft)
+{
+	OffsetNumber maxoff;
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	Size		tupspace;
+	Size		hikeysize;
+	int			keepnatts;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Proceed only when items on page look fairly short */
+	if (maxoff < MaxIndexTuplesPerPage / 2)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * Assume that the optimization won't be useful unless tuples are of
+	 * uniform size, with the exception of the high key, which may already be
+	 * truncation.
+	 */
+	Assert(!P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)));
+	tupspace = ((PageHeader) page)->pd_special - ((PageHeader) page)->pd_upper;
+	itemid = PageGetItemId(page, P_HIKEY);
+	hikeysize = ItemIdGetLength(itemid);
+	if (IndexTupleSize(newitem) * (maxoff - 1) != tupspace - hikeysize)
+		return false;
+
+	/*
+	 * When heap TIDs appear in DESC order, consider left-heavy split.
+	 *
+	 * Accept left-heavy split when new item, which will be inserted at first
+	 * data offset, has adjacent TID to extant item at that position.  This is
+	 * considered equivalent to a rightmost split, so apply flipped-around
+	 * fillfactor.
+	 */
+	if (newitemoff == P_FIRSTKEY)
+	{
+		itemid = PageGetItemId(page, P_FIRSTKEY);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+		{
+			*propfullonleft = (double) Max(100 - leaffillfactor,
+										   BTREE_MIN_FILLFACTOR) / 100.0;
+			return true;
+		}
+
+		return false;
+	}
+
+	/*
+	 * At least the first attribute must be equal, but new item cannot be a
+	 * simple duplicate of item that belongs to its immediate left
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+		{
+			*propfullonleft = (double) leaffillfactor / 100.0;
+			return true;
+		}
+
+		return false;
+	}
+
+	/* When item isn't first or last on origpage, consider heap TID too */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+	if (!_bt_adjacenthtid(&tup->t_tid, &newitem->t_tid))
+		return false;
+	keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		*propfullonleft = (double) newitemoff / (((double) maxoff + 1));
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -820,3 +956,31 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	Assert(lastleft != firstright);
 	return _bt_keep_natts_fast(rel, lastleft, firstright);
 }
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
-- 
2.17.1

v10-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchapplication/x-patch; name=v10-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchDownload

From 5c0b5feb20e15a74276fa1fedbdb389616ce4930 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v10 3/7] Treat heap TID as part of the nbtree key space.

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID.  Non-unique index insertions will descend straight to the leaf page
that they'll insert on to (unless there is a concurrent page split).
This general approach has numerous benefits for performance, and is
prerequisite to teaching VACUUM to perform "retail index tuple
deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

We no longer allow a search for free space among multiple pages full of
duplicates to "get tired", except when needed to preserve compatibility
with earlier versions.  This has significant benefits for free space
management in secondary indexes on low cardinality attributes.  However,
without the next commit in the patch series (without having "single
value" mode and "many duplicates" mode within _bt_findsplitloc()), these
cases will be significantly regressed, since they'll naively perform
50:50 splits without there being any hope of reusing space left free on
the left half of the split.

Note that this commit reduces the size of new tuples by a single
MAXALIGN() quantum.  The documented definition of "1/3 of a page" is
already inexact, so it seems unnecessary to revise it.  However, there
should be a compatibility note in the v12 release notes.  The new
definition is 2704 bytes on 64-bit systems, down from 2712 bytes.
---
 contrib/amcheck/verify_nbtree.c              | 304 ++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 164 ++++---
 src/backend/access/nbtree/nbtinsert.c        | 291 +++++++-----
 src/backend/access/nbtree/nbtpage.c          | 205 +++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 103 ++++-
 src/backend/access/nbtree/nbtsort.c          |  88 ++--
 src/backend/access/nbtree/nbtutils.c         | 463 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 166 +++++--
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/tools/pgindent/typedefs.list             |   4 +
 23 files changed, 1502 insertions(+), 436 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index c7cdca3962..b5e2709c88 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -44,6 +44,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -65,6 +67,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -121,7 +125,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -134,15 +138,19 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
 				   OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
 							 BTScanInsert key,
 							 Page nontarget,
 							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -199,6 +207,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -249,7 +258,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -319,8 +330,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -341,6 +352,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -801,7 +813,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -834,6 +847,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -860,7 +874,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -901,7 +916,58 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset/tuple */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = _bt_mkscankey(state->rel, itup, false);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a MAXALIGN() quantum of space from BTMaxItemSize() in order to
+		 * ensure that suffix truncation always has enough space to add an
+		 * explicit heap TID back to a tuple -- we pessimistically assume that
+		 * every newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since a MAXALIGN() quantum is reserved for that purpose, we must
+		 * not enforce the slightly lower limit when the extra quantum has
+		 * been used as intended.  In other words, there is only a
+		 * cross-version difference in the limit on tuple size within leaf
+		 * pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra quantum for
+		 * its designated purpose.  Enforce the lower limit for pivot tuples
+		 * when an explicit heap TID isn't actually present. (In all other
+		 * cases suffix truncation is guaranteed to generate a pivot tuple
+		 * that's no larger than the first right tuple provided to it by its
+		 * caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -926,9 +992,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -954,10 +1046,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey, OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1022,9 +1114,9 @@ bt_target_page_check(BtreeCheckState *state)
 
 			/* Set up right item scankey */
 			if (righttup)
-				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightkey = _bt_mkscankey(state->rel, righttup, false);
 
-			if (righttup && !invariant_geq_offset(state, rightkey, max))
+			if (righttup && !invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_tuple(),
@@ -1354,7 +1446,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1403,14 +1496,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1750,6 +1858,63 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search.  In practice, this
+	 * behavior is equivalent to an explicit negative infinity representation
+	 * within nbtree.  We care about the distinction between strict and
+	 * non-strict bounds, though, and so must consider truncated/negative
+	 * infinity attributes explicitly.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1769,42 +1934,96 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  That could cause
+	 * us to miss the fact that the scankey is less than rather than equal to
+	 * its lower bound, but the index is corrupt either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
 							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/*
+	 * pg_upgrade'd indexes may legally have equal sibling tuples.  Their
+	 * pivot tuples can never have key attributes truncated away.
+	 */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as
+	 * meaning that they're not participating in a search.  In practice, this
+	 * behavior is equivalent to an explicit negative infinity representation
+	 * within nbtree.  We care about the distinction between strict and
+	 * non-strict bounds, though, and so must consider truncated/negative
+	 * infinity attributes explicitly.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1960,3 +2179,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 184ac62255..251be13b65 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -560,7 +560,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index bc0c614f3b..a1c3a6f705 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..cb9ed61599 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -111,6 +124,16 @@ it is necessary to lock the next page before releasing the current one.
 This is safe when moving right or up, but not when moving left or down
 (else we'd create the possibility of deadlocks).
 
+We don't use btree keys to re-find downlinks from parent pages when
+inserting a new downlink in parent during page splits.  Only one entry
+in the parent level will be pointing at the page we just split, so the
+link fields can be used to re-find downlinks in the parent via a
+linear search.  We don't need to remember key values during the
+initial descent; remembering index block numbers instead works just as
+well in practice.  This is the only approach that works for indexes
+initialized by btree versions predating the use of heap TID as a
+tiebreaker attribute.
+
 Lehman and Yao fail to discuss what must happen when the root page
 becomes full and must be split.  Our implementation is to split the
 root in the same way that any other page would be split, then construct
@@ -598,33 +621,53 @@ the order of multiple keys for a given column is unspecified.)  An
 insertion scankey uses the same array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is exactly one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains all key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +680,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +707,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index cbc07d316b..04c6023cba 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -72,7 +72,7 @@ static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   bool split_only_page);
 static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
 		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+		  IndexTuple newitem, bool newitemonleft, bool truncate);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -115,13 +115,16 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack = NULL;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offset;
 	BTPageOpaque lpageop;
 	bool		fastpath;
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_scankey = _bt_mkscankey(rel, itup, false);
+top:
+	/* Cannot use real heap TID in unique case -- it'll be restored later */
+	if (itup_scankey->heapkeyspace && checkingunique)
+		itup_scankey->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -142,9 +145,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 * other backend might be concurrently inserting into the page, thus
 	 * reducing our chances to finding an insertion place in this page.
 	 */
-top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -226,12 +227,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -244,6 +246,7 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
 		page = BufferGetPage(buf);
 		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -277,6 +280,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_scankey->heapkeyspace)
+			itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -293,7 +300,7 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
+		/* do the insertion, possibly on a page to the right in unique case */
 		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf, checkingunique,
 									  itup, stack, heapRel);
 		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
@@ -346,6 +353,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
 	bool		found = false;
 
 	/* Assume unique until we find a duplicate */
+	Assert(itup_scankey->scantid == NULL);
 	*is_unique = true;
 
 	/* _bt_binsrch() alone may determine that there are no duplicates */
@@ -567,6 +575,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
 			if (P_RIGHTMOST(opaque))
 				break;
 			highkeycmp = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+			/* scantid-less scankey should be <= hikey */
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -614,16 +623,16 @@ notfound:
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
  *		_bt_check_unique() callers arrange for their insertion scan key to
  *		save the progress of the last binary search performed.  No additional
@@ -666,46 +675,66 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
+	/* Check 1/3 of a page restriction */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, itup_scankey->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_scankey->heapkeyspace || itup_scankey->scantid != NULL);
+	Assert(itup_scankey->heapkeyspace || itup_scankey->scantid == NULL);
 	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 		int			cmpval;
 
+		/*
+		 * No need to check high key when inserting into a non-unique index --
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required.  Insertion scankey's scantid would have been
+		 * filled out at the time.
+		 */
+		if (itup_scankey->heapkeyspace && !checkingunique)
+		{
+			Assert(P_RIGHTMOST(lpageop) ||
+				   _bt_compare(rel, itup_scankey, page, P_HIKEY) <= 0);
+			break;
+		}
+
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_scankey->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -714,6 +743,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -728,7 +759,10 @@ _bt_findinsertloc(Relation rel,
 			 * If this page was incompletely split, finish the split now. We
 			 * do this while holding a lock on the left sibling, which is not
 			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
 			 */
 			if (P_INCOMPLETE_SPLIT(lpageop))
 			{
@@ -777,6 +811,11 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
  *		This function handles the question of whether or not an insertion
  *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
  *		insert on the page contained in buf when a choice must be made.
@@ -878,6 +917,8 @@ _bt_insertonpg(Relation rel,
 	BTPageOpaque lpageop;
 	OffsetNumber firstright = InvalidOffsetNumber;
 	Size		itemsz;
+	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -885,12 +926,9 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
-	Assert(!P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
-	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
+	Assert(!P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) == indnatts);
+	Assert(P_ISLEAF(lpageop) || BTreeTupleGetNAtts(itup, rel) <= indnkeyatts);
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -912,6 +950,7 @@ _bt_insertonpg(Relation rel,
 	{
 		bool		is_root = P_ISROOT(lpageop);
 		bool		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
+		bool		truncate;
 		bool		newitemonleft;
 		Buffer		rbuf;
 
@@ -938,9 +977,16 @@ _bt_insertonpg(Relation rel,
 									  newitemoff, itemsz,
 									  &newitemonleft);
 
+		/*
+		 * Perform truncation of the new high key for the left half of the
+		 * split when splitting a leaf page.  Don't do so with version 3
+		 * indexes unless the index has non-key attributes.
+		 */
+		truncate = P_ISLEAF(lpageop) &&
+			(_bt_heapkeyspace(rel) || indnatts != indnkeyatts);
 		/* split the buffer into left and right halves */
 		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+						 newitemoff, itemsz, itup, newitemonleft, truncate);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -980,7 +1026,8 @@ _bt_insertonpg(Relation rel,
 		 * only one on its tree level, but was not the root, it may have been
 		 * the "fast root".  We need to ensure that the fast root link points
 		 * at or above the current page.  We can safely acquire a lock on the
-		 * metapage here --- see comments for _bt_newroot().
+		 * metapage here --- see comments for _bt_heapkeyspace() and
+		 * _bt_newroot().
 		 */
 		if (split_only_page)
 		{
@@ -1022,7 +1069,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1077,6 +1124,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1142,7 +1191,10 @@ _bt_insertonpg(Relation rel,
  *		On entry, buf is the page to split, and is pinned and write-locked.
  *		firstright is the item index of the first item to be moved to the
  *		new right page.  newitemoff etc. tell us about the new item that
- *		must be inserted along with the data from the old page.
+ *		must be inserted along with the data from the old page.  truncate
+ *		tells us if the new high key should undergo suffix truncation.
+ *		(Version 4 pivot tuples always have an explicit representation of
+ *		the number of non-truncated attributes that remain.)
  *
  *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
  *		page we're inserting the downlink for.  This function will clear the
@@ -1154,7 +1206,7 @@ _bt_insertonpg(Relation rel,
 static Buffer
 _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+		  bool newitemonleft, bool truncate)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1177,8 +1229,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1249,7 +1299,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1263,8 +1315,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1282,25 +1335,60 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (truncate)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		Assert(isleaf);
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, false);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1493,7 +1581,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1522,22 +1609,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1555,9 +1630,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1920,7 +1993,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -2137,11 +2210,9 @@ _bt_getstackbuf(Relation rel, BTStack stack, int access)
  *		We've just split the old root page and need to create a new one.
  *		In order to do this, we add a new root page to the file, then lock
  *		the metadata page and update it.  This is guaranteed to be deadlock-
- *		free, because all readers release their locks on the metadata page
- *		before trying to lock the root, and all writers lock the root before
- *		trying to lock the metadata page.  We have a write lock on the old
- *		root page, so we have not introduced any cycles into the waits-for
- *		graph.
+ *		free, for the same reason the frequent calls to _bt_heapkeyspace()
+ *		are guaranteed safe.  We have a write lock on the old root page, so
+ *		we have not introduced any cycles into the waits-for graph.
  *
  *		On entry, lbuf (the old root) and rbuf (its new peer) are write-
  *		locked. On exit, a new root page exists with entries for the
@@ -2210,7 +2281,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2245,7 +2316,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2278,6 +2350,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2342,6 +2416,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2357,8 +2432,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d0cf73718f..b0b58850a4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *	3, the last version that can be updated without broadly affecting on-disk
+ *	compatibility.  (A REINDEX is required to upgrade to version 4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -492,7 +535,8 @@ _bt_getroot(Relation rel, int access)
  * from whatever non-root page we were at.  If we ever do need to lock the
  * one true root page, we could loop here, re-reading the metapage on each
  * failure.  (Note that it wouldn't do to hold the lock on the metapage while
- * moving to the root --- that'd deadlock against any concurrent root split.)
+ * moving to the root --- that'd deadlock against certain concurrent calls to
+ * _bt_heapkeyspace(), or any concurrent root page split.)
  */
 Buffer
 _bt_gettrueroot(Relation rel)
@@ -595,37 +639,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +661,80 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ *
+ *		Calling here with locks on other pages in the index is guaranteed to
+ *		be deadlock-free, because all readers release their locks on the
+ *		metadata page before trying to lock any other page, and all writers
+ *		lock other pages before trying to lock the metadata page
+ *		(_bt_getbuf() may be called with a buffer lock on the metapage held
+ *		to allocate a new root page, but _bt_getbuf() is careful about
+ *		deadlocks when recycling a page from the FSM).  It is natural to
+ *		buffer lock the metapage last and release its buffer lock first,
+ *		since nobody insists on reliably reaching the current true root.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1420,10 +1500,21 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
+				itup_scankey = _bt_mkscankey(rel, targetkey, false);
+				/* high key may have minus infinity (truncated) attributes */
+				itup_scankey->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
 
+				/*
+				 * Search will reliably relocate same leaf page.
+				 *
+				 * (However, prior to version 4 the search is for the leftmost
+				 * leaf page containing this key, which is okay because we
+				 * will tiebreak on downlink block number.)
+				 */
+				Assert(!itup_scankey->heapkeyspace ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
@@ -1890,7 +1981,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	 * half-dead, even though it theoretically could occur.
 	 *
 	 * We can safely acquire a lock on the metapage here --- see comments for
-	 * _bt_newroot().
+	 * _bt_heapkeyspace() and _bt_newroot().
 	 */
 	if (leftsib == P_NONE && rightsib_is_rightmost)
 	{
@@ -1969,7 +2060,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2108,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 9e44e88190..701115f5b9 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,10 @@
 #include "utils/tqual.h"
 
 
+static inline int32 _bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -155,8 +159,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link to disambiguate duplicate keys in the index, which is
+		 * required when dealing with pg_upgrade'd !heapkeyspace indexes.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -254,11 +258,15 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.  Duplicate pivots on
+	 * internal pages are useless to all index scans, which was a flaw in the
+	 * old design.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -363,6 +371,11 @@ _bt_binsrch(Relation rel,
 	isleaf = P_ISLEAF(opaque);
 
 	Assert(!(key->restorebinsrch && key->savebinsrch));
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state when scantid is available */
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->heapkeyspace || !key->restorebinsrch || key->scantid != NULL);
 	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
 	if (!key->restorebinsrch)
@@ -422,7 +435,10 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, key, page, mid);
+		if (!isleaf)
+			result = _bt_compare(rel, key, page, mid);
+		else
+			result = _bt_nonpivot_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -490,17 +506,44 @@ _bt_compare(Relation rel,
 {
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
+	return _bt_tuple_compare(rel, key, itup, ntupatts);
+}
+
+/*
+ * Optimized version of _bt_compare().  Only works on non-pivot tuples.
+ */
+static inline int32
+_bt_nonpivot_compare(Relation rel,
+					 BTScanInsert key,
+					 Page page,
+					 OffsetNumber offnum)
+{
+	IndexTuple	itup;
+
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
+
+	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	Assert(BTreeTupleGetNAtts(itup, rel) ==
+		   IndexRelationGetNumberOfAttributes(rel));
 	return _bt_tuple_compare(rel, key, itup, key->keysz);
 }
 
@@ -523,13 +566,17 @@ int32
 _bt_tuple_compare(Relation rel,
 				  BTScanInsert key,
 				  IndexTuple itup,
-				  int ncmpkey)
+				  int ntupatts)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ItemPointer heapTid;
+	int			ncmpkey;
 	int			i;
 	ScanKey		scankey;
 
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(!key->minusinfkey || key->heapkeyspace);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -543,6 +590,7 @@ _bt_tuple_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
 	scankey = key->scankeys;
 	for (i = 1; i <= ncmpkey; i++)
 	{
@@ -595,8 +643,40 @@ _bt_tuple_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * May be able to apply the "avoid minus infinity search" optimization
+		 * with truncated pivot tuples
+		 */
+		if ((itup->t_info & INDEX_ALT_TID_MASK) != 0 && !key->minusinfkey &&
+			heapTid == NULL && key->keysz == ntupatts)
+			return 1;
+
+		return 0;
+	}
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	if (heapTid == NULL)
+		return 1;
+
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1113,7 +1193,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scankey fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.high = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 933fb4dfe7..87b549a96b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -743,6 +743,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -796,8 +797,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -814,27 +813,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one MAXALIGN() quantum larger than the original first right tuple
+	 * it's derived from.  v4 deals with the problem by decreasing the limit
+	 * on the size of tuples inserted on the leaf level by the same small
+	 * amount.  Enforce the new v4+ limit on the leaf level, and the old limit
+	 * on internal levels, since pivot tuples may need to make use of the
+	 * spare MAXALIGN() quantum.  This should never fail on internal pages.
 	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -880,24 +873,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -905,7 +909,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup, true);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -924,8 +931,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -970,7 +978,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1029,8 +1037,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1127,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1134,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1151,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 69d67fb428..23d75ad604 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, bool build);
 
 
 /*
@@ -56,10 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  When
+ *		itup is a pivot tuple, the returned insertion scankey is suitable
+ *		for locating the leaf page with the pivot as its high key (there
+ *		must have been one like it at some point if the pivot tuple
+ *		actually came from the tree).
+ *
+ *		Note that we may occasionally have to share lock the metapage, in
+ *		order to determine whether or not the keys in the index are
+ *		expected to be unique (i.e. a "heapkeyspace" index).  Callers that
+ *		are building a new index cannot let us access the non-existent
+ *		metapage.  This is okay because we can safely assume that the
+ *		index is on the latest btree version, which must be a
+ *		"heapkeyspace" version.
+ *
  *		The result is intended for use with _bt_compare().
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, bool build)
 {
 	BTScanInsert inskey;
 	ScanKey		skey;
@@ -80,15 +97,37 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	inskey = palloc(offsetof(BTScanInsertData, scankeys) +
 					sizeof(ScanKeyData) * indnkeyatts);
+	inskey->heapkeyspace = build || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate high keys in heapkeyspace indexes.  Caller also
+	 * opts out of the "no minus infinity key" optimization, so search moves
+	 * left on scankey-equal downlink in parent, allowing VACUUM caller to
+	 * reliably relocate leaf page undergoing deletion.
+	 */
+	inskey->minusinfkey = !inskey->heapkeyspace;
 	inskey->savebinsrch = inskey->restorebinsrch = false;
 	inskey->low = inskey->high = InvalidOffsetNumber;
 	inskey->nextkey = false;
 	inskey->keysz = Min(indnkeyatts, tupnatts);
+	inskey->scantid = inskey->heapkeyspace ? BTreeTupleGetHeapTID(itup) : NULL;
 	skey = inskey->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -102,7 +141,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Keys built from truncated attributes are defensively represented as
+		 * NULL values, though they should still not participate in
+		 * comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2078,38 +2129,265 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare()/_bt_tuple_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
+ *
+ * CREATE INDEX callers must pass build = true, in order to avoid metapage
+ * access.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 bool build)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, build);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Artificially force truncation to always append heap TID */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within keepnatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * This must be an INCLUDE index where only non-key attributes could
+		 * be truncated away.  They are not considered part of the key space,
+		 * so it's still necessary to add a heap TID attribute to the new
+		 * pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal, and
+		 * there are no non-key attributes that need to be truncated in
+		 * passing.  It's necessary to add a heap TID attribute to the new
+		 * pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID attribute in the right item readily distinguishes the right
+	 * side of the split from the left side.  Use enlarged space that holds a
+	 * copy of first right tuple; place a heap TID value within the extra
+	 * space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 *
+	 * Callers generally try to avoid choosing a split point that necessitates
+	 * that we do this.  Splits of pages that only involve a single distinct
+	 * value (or set of values) must end up here, though.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a single MAXALIGN()
+	 * quantum.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  CREATE INDEX
+ * callers must pass build = true so that we may avoid metapage access.  (This
+ * is okay because CREATE INDEX always creates an index on the latest btree
+ * version.)
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   bool build)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+	BTScanInsert key;
+
+	key = _bt_mkscankey(rel, firstright, build);
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	Assert(key->keysz == nkeyatts);
+	scankey = key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		isNull2 = (scankey->sk_flags & SK_ISNULL) != 0;
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											scankey->sk_argument)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	/*
+	 * Make sure that an authoritative comparison that considers per-column
+	 * options like ASC/DESC/NULLS FIRST/NULLS LAST indicates that it's okay
+	 * to truncate firstright tuple up to keepnatts -- we expect to get a new
+	 * pivot that's strictly greater than lastleft when truncation can go
+	 * ahead.  (A truncated version of firstright is also bound to be strictly
+	 * less than firstright, since their attributes will be equal prior to one
+	 * or more truncated negative infinity attributes.)
+	 */
+	Assert(keepnatts == nkeyatts + 1 ||
+		   _bt_tuple_compare(rel, key, lastleft, keepnatts) > 0);
+
+	/* Can't leak memory here */
+	pfree(key);
+
+	return keepnatts;
 }
 
 /*
@@ -2123,15 +2401,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2151,16 +2431,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2170,8 +2460,16 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes.
+			 * Note that tupnatts will only have been explicitly represented
+			 * in !heapkeyspace indexes that happen to have non-key
+			 * attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2183,7 +2481,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2194,18 +2496,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 489eee095e..cf603d6944 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c8fd036c9e..d0eb68e3d2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,53 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
+ * general changes to the on-disk representation to add support for
+ * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
+ * semantics in pg_upgrade scenarios.  We continue to offer support for
+ * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
+ * up to and including v10 to v12+ without requiring a REINDEX.
+ * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
+ * support upgrades from v11 to v12+ without requiring a REINDEX.
+ *
+ * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
+ * to BTREE_NOVAC_VERSION automatically.  v11's "no vacuuming" enhancement
+ * (the ability to skip full index scans during vacuuming) only requires
+ * two new metapage fields, which makes it possible to upgrade at any
+ * point that the metapage must be updated anyway (e.g. during a root page
+ * split).  Note also that there happened to be no changes in metapage
+ * layout for btree version 4.  All current metapage fields should have
+ * valid values set when a metapage WAL record is replayed.
+ *
+ * It's convenient to consider the "no vacuuming" enhancement (metapage
+ * layout compatibility) separately from heapkeyspace semantics, since
+ * each issue affects different areas.  This is just a convention; in
+ * practice a heapkeyspace index is always also a "no vacuuming" index.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +238,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +279,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +297,46 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -326,25 +402,53 @@ typedef BTStackData *BTStack;
  * _bt_search.  For details on its mutable state, see _bt_binsrch and
  * _bt_findinsertloc.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.  When
+ * minusinfkey is false (the usual case), _bt_tuple_compare will consider a
+ * scankey greater than a pivot tuple where all explicitly represented
+ * attributes are equal to the scankey, provided that the pivot tuple has at
+ * least one attribute truncated away (this is often just the heap TID
+ * attribute).  We exploit the fact that minus infinity is a value that only
+ * appears in pivot tuples (to make suffix truncation work), and is therefore
+ * not interesting (page deletion by VACUUM is the one case where the
+ * optimization cannot be used, since a leaf page is relocated using its high
+ * key).  This optimization allows us to get the full benefit of suffix
+ * truncation, particularly with indexes where each distinct set of user
+ * attribute keys appear in at least a few duplicate entries.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +456,10 @@ typedef struct BTScanInsertData
 	OffsetNumber high;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +689,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -606,7 +714,7 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern int32 _bt_tuple_compare(Relation rel, BTScanInsert key, IndexTuple itup,
-				  int ncmpkey);
+				  int ntupatts);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -615,7 +723,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, bool build);
 extern ScanKey _bt_mkscankey_nodata(Relation rel);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
@@ -638,8 +746,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, bool build);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+					OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..a4cbdff283 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 1d12b01068..06fe44d39a 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9fe950b29d..08cf72d670 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -167,6 +167,8 @@ BTArrayKeyInfo
 BTBuildState
 BTCycleId
 BTIndexStat
+BTInsertionKey
+BTInsertionKeyData
 BTLeader
 BTMetaPageData
 BTOneVacInfo
@@ -2207,6 +2209,8 @@ SpecialJoinInfo
 SpinDelayStatus
 SplitInterval
 SplitLR
+SplitMode
+SplitPoint
 SplitVar
 SplitedPageLayout
 StackElem
-- 
2.17.1

v10-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v10-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From f4cb08cf2e7cff946d270c0479f957cbb9a22f4b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v10 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 251be13b65..ab60a8942e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_am.h"
@@ -242,6 +243,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -253,9 +255,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -264,6 +266,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -282,16 +286,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -365,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -396,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -481,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v10-0006-Add-high-key-continuescan-optimization.patchapplication/x-patch; name=v10-0006-Add-high-key-continuescan-optimization.patchDownload

From 5d1bf02d8d92ba3b76de069c5c11ec04dfa6e204 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v10 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++---
 src/backend/access/nbtree/nbtutils.c  | 60 +++++++++++++++++++++------
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 701115f5b9..ea2db7f818 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1367,7 +1367,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber maxoff;
 	int			itemIndex;
 	IndexTuple	itup;
-	bool		continuescan;
+	bool		continuescan = true;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1435,16 +1435,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you'd
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples within a range of acceptable split points.  There
+		 * is often natural locality around what ends up on each leaf page,
+		 * which is worth taking advantage of here.
+		 */
+		if (!P_RIGHTMOST(opaque) && continuescan)
+			(void) _bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 43f7bfcb44..7839be62a5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, bool build);
@@ -1411,7 +1411,10 @@ _bt_mark_scankey_required(ScanKey skey)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1421,6 +1424,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1434,21 +1438,24 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		Assert(offnum != P_HIKEY || P_RIGHTMOST(opaque));
 		if (ScanDirectionIsForward(dir))
 		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
+			/* forward scan callers check high key instead */
+			return NULL;
 		}
 		else
 		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
+			/* return immediately if there are more tuples on the page */
 			if (offnum > P_FIRSTDATAKEY(opaque))
 				return NULL;
 		}
@@ -1463,6 +1470,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1474,11 +1482,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1609,8 +1630,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1627,6 +1648,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

#52

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#51)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, and
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

* Addresses Heikki's concerns about locking the metapage more
frequently in a general way. Comments are added to nbtpage.c, and
updated in a number of places that already talk about the same risk.

Attached is v11 of the patch, which removes the comments mentioned
here, and instead finds a way to not do new things with buffer locks.

Changes
=======

* We simply avoid holding buffer locks while accessing the metapage.
(Of course, the old root page split stuff still does this -- it has
worked that way forever.)

* We also avoid calling index_getprocinfo() with any buffer lock held.
We'll still call support function 1 with a buffer lock held to
truncate, but that's not new -- *any* insertion will do that.

For some reason I got stuck on the idea that we need to use a
scankey's own values within _bt_truncate()/_bt_keep_natts() by
generating a new insertion scankey every time. We now simply ignore
those values, and call the comparator with pairs of tuples that each
come from the page directly. Usually, we'll just reuse the insertion
scankey that we were using for the insertion anyway (we no longer
build our own scankey for truncation). Other times, we'll build an
empty insertion scankey (one that has the function pointer and so on,
but no values). The only downside is that I cannot have an assertion
that calls _bt_compare() to make sure we truncated correctly there and
then, since a dedicated insertion scankey is no longer conveniently
available.

I feel rather silly for not having gone this way from the beginning,
because the new approach is quite obviously simpler and safer.
nbtsort.c now gets a reusable, valueless insertion scankey that it
uses for both truncation and for setting up a merge of the two spools
for unique index builds. This approach allows me to remove
_bt_mkscankey_nodata() altogether -- callers build a "nodata"
insertion scankey with empty values by passing _bt_mkscankey() a NULL
tuple. That's equivalent to having an insertion scankey built from an
all-attributes-truncated tuple. IOW, the patch now makes the "nodata"
stuff a degenerate case of building a scankey from a
truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert
struct. There is no longer any case where there in an insertion
scankey that isn't accessed using the BTScanInsert struct.

* No more pg_depend tie-breaker column commit. That was an ugly hack,
that I'm glad to be rid of -- many thanks to Tom for working through a
number of test instability issues that affected the patch. I do still
need to paper-over one remaining regression test issue/bug that the
patch happens to unmask, pending Tom fixing it directly [1]/messages/by-id/19220.1547767251@sss.pgh.pa.us. This
papering-over is broken out into its own commit
("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I
expect that Tom will fix the bug before too long, at which point the
temporary work around can just be reverted from your local tree.

Outlook
=======

I feel that this version is pretty close to being commitable, since
everything about the design is settled. It completely avoids saying
anything new about buffer locking protocols, LWLock deadlock safety,
etc. VACUUM and crash recovery are also unchanged, so subtle bugs
should at least not be too hard to reproduce when observed once. It's
pretty complementary code: the new logic for picking a split point
builds a list of candidate split points using the old technique, with
a second pass to choose the best one for suffix truncation among the
accumulated list. Hard to see how that could introduce an invalid
split point choice.

I take the risk of introducing new corruption bugs very seriously.
contrib/amcheck now verifies all aspects of the new on-disk
representation. The stricter Lehman & Yao style invariant ("the
subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be
stricter in what it will accept (e.g., heap TID needs to be in order
among logical duplicates, we always expect to see a representation of
the number of pivot tuple attributes, and we expect the high key to be
strictly greater than items on internal pages).

Review
======

It would be very helpful if a reviewer such as Heikki or Alexander
could take a look at the patch once more. I suggest that they look at
the following points in the patch:

* The minusinfkey stuff, which is explained within _bt_compare(), and
within nbtree.h header comments. Page deletion by VACUUM is the only
_bt_search() caller that sets minusinfkey to true (though older
versions of btree and amcheck also set minusinfkey to true).

* Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being
a little too aggressive there, possibly hurting workloads where HOT
pruning occurs periodically? Sane duplicate handling is the most
compelling improvement that the patch makes, but I may still have been
a bit too aggressive in packing pages full of duplicates so tightly. I
figured that that was the closest thing to the previous behavior
that's still reasonable.

* Does the _bt_splitatnewitem() criteria for deciding if we should
split at the point the new tuple is positioned at miss some subtlety?
It's important that splitting at the new item when we shouldn't
doesn't happen, or hardly ever happens -- it should be
*self-limiting*. This was tested using BenchmarkSQL/TPC-C [2]https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- TPC-C
has a workload where this particular enhancement makes indexes a lot
smaller.

* There was also testing of index bloat following bulk insertions,
based on my own custom test suite. Data and indexes were taken from
TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3]https://wiki.postgresql.org/wiki/Sample_Databases,
and the Mouse Genome Database Project (Postgres schema + indexes) [4]http://www.informatics.jax.org/software.shtml -- Peter Geoghegan.
This takes almost an hour to run on my development machine, though the
most important tests finish in less than 5 minutes. I can provide
access to all or some of these tests, if reviewers are interested and
are willing to download several gigabytes of sample data that I'll
provide privately.

[1]: /messages/by-id/19220.1547767251@sss.pgh.pa.us
[2]: https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
[3]: https://wiki.postgresql.org/wiki/Sample_Databases
[4]: http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patchapplication/x-patch; name=v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patchDownload

From 56f5b91e356736390be76859b52590a4de199c17 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 21 Jan 2019 12:43:55 -0800
Subject: [PATCH v11 2/7] Paper-over DEPENDENCY_INTERNAL_AUTO bug failures.

The heap key space commit (which follows this commit) causes existing
regression tests to fail by unmasking an existing bug relating to the
creation of multiple DEPENDENCY_INTERNAL_AUTO entries [1] by table
partitioning.  Paper over those failures as a short-term workaround to
keep the tests passing.  It's expected that a direct fix for the
underlying issue will be committed shortly, making the workaround
unnecessary.

[1] https://postgr.es/m/19220.1547767251@sss.pgh.pa.us
---
 src/test/regress/expected/indexing.out | 12 ++++++------
 src/test/regress/expected/triggers.out | 12 ++++++------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 118f2c78df..f685e96b24 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -155,8 +155,8 @@ create table idxpart (a int) partition by range (a);
 create index on idxpart (a);
 create table idxpart1 partition of idxpart for values from (0) to (10);
 drop index idxpart1_a_idx;	-- no way
-ERROR:  cannot drop index idxpart1_a_idx because index idxpart_a_idx requires it
-HINT:  You can drop index idxpart_a_idx instead.
+ERROR:  cannot drop index idxpart1_a_idx because column a of table idxpart1 requires it
+HINT:  You can drop column a of table idxpart1 instead.
 drop index idxpart_a_idx;	-- both indexes go away
 select relname, relkind from pg_class
   where relname like 'idxpart%' order by relname;
@@ -1005,11 +1005,11 @@ select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid
 (3 rows)
 
 drop index idxpart0_pkey;								-- fail
-ERROR:  cannot drop index idxpart0_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart0_pkey because constraint idxpart0_pkey on table idxpart0 requires it
+HINT:  You can drop constraint idxpart0_pkey on table idxpart0 instead.
 drop index idxpart1_pkey;								-- fail
-ERROR:  cannot drop index idxpart1_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart1_pkey because constraint idxpart1_pkey on table idxpart1 requires it
+HINT:  You can drop constraint idxpart1_pkey on table idxpart1 instead.
 alter table idxpart0 drop constraint idxpart0_pkey;		-- fail
 ERROR:  cannot drop inherited constraint "idxpart0_pkey" of relation "idxpart0"
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index f561de9222..92fbd27da6 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -1888,14 +1888,14 @@ select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
 (4 rows)
 
 drop trigger trg1 on trigpart1;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
 drop trigger trg1 on trigpart2;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart2 because table trigpart2 requires it
+HINT:  You can drop table trigpart2 instead.
 drop trigger trg1 on trigpart3;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart3 because table trigpart3 requires it
+HINT:  You can drop table trigpart3 instead.
 drop table trigpart2;			-- ok, trigger should be gone in that partition
 select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
   where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
-- 
2.17.1

v11-0001-Refactor-nbtree-insertion-scankeys.patchapplication/x-patch; name=v11-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From bfe212c605184a940c936526e857eb9f1d0552e9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v11 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.  (It also allows us to store mutable state in
the insertion scankey, which may one day be used to reduce the amount of
calls to support function 1 comparators during the initial descent of a
B-Tree index [1].)

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() calls now fall out immediately in
the common case where it's already clear that there couldn't possibly be
a duplicate.

Importantly, the new _bt_check_unique() scheme makes it a lot easier to
manage cached binary search effort within _bt_findinsertloc().  This
matters a lot to the upcoming patch to make nbtree tuples unique by
treating heap TID as a tie-breaker attribute, since it allows
_bt_findinsertloc() to sensibly deal with pre-pg_upgrade indexes and
unique indexes as special cases.

Based on a suggestion by Andrey Lepikhov.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  52 ++---
 src/backend/access/nbtree/nbtinsert.c | 317 +++++++++++++++-----------
 src/backend/access/nbtree/nbtpage.c   |  11 +-
 src/backend/access/nbtree/nbtsearch.c | 186 +++++++++------
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  96 +++-----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 +++--
 8 files changed, 418 insertions(+), 329 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 1c7466c815..9ec023fae3 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -136,14 +136,14 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -835,8 +835,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1018,7 +1018,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1070,7 +1070,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1099,11 +1099,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1291,8 +1292,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1305,8 +1306,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1411,8 +1412,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1760,13 +1760,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1779,13 +1778,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1801,14 +1799,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5c2b8034f5..458f8cb296 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken);
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,15 +110,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_scankey;
 	BTStack		stack = NULL;
 	Buffer		buf;
+	Page		page;
 	OffsetNumber offset;
+	BTPageOpaque lpageop;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -148,8 +147,6 @@ top:
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
-		Page		page;
-		BTPageOpaque lpageop;
 
 		/*
 		 * Conditionally acquire exclusive lock on the buffer before doing any
@@ -179,8 +176,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_scankey, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +215,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_scankey, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +239,21 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		page = BufferGetPage(buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * Arrange for the later _bt_findinsertloc call to _bt_binsrch to
+		 * avoid repeating the work done during this initial _bt_binsrch call
+		 */
+		itup_scankey->savebinsrch = true;
+		offset = _bt_binsrch(rel, itup_scankey, buf);
+		xwait = _bt_check_unique(rel, itup_scankey, itup, heapRel, buf, offset,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +280,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber insertoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -288,9 +293,9 @@ top:
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf, checkingunique,
+									  itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
 	}
 	else
 	{
@@ -301,7 +306,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_scankey);
 
 	return is_unique;
 }
@@ -326,13 +331,12 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
-				 IndexUniqueCheck checkUnique, bool *is_unique,
-				 uint32 *speculativeToken)
+_bt_check_unique(Relation rel, BTScanInsert itup_scankey,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
+				 OffsetNumber offset, IndexUniqueCheck checkUnique,
+				 bool *is_unique, uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
 	OffsetNumber maxoff;
 	Page		page;
@@ -343,12 +347,30 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	/* Assume unique until we find a duplicate */
 	*is_unique = true;
 
-	InitDirtySnapshot(SnapshotDirty);
-
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Fastpath: _bt_binsrch() alone may determine that there are no
+	 * duplicates.
+	 *
+	 * Fastpath won't be taken in the case where there are no items on the
+	 * page (perhaps only a highkey) -- we may have to check the right sibling
+	 * (low > high).  It also won't be taken in the case where there are items
+	 * on the page, but a strict upper bound was not established by
+	 * _bt_binsrch() -- another case where we may have to check the right
+	 * sibling.
+	 */
+	Assert(itup_scankey->low == offset);
+	if (itup_scankey->low == itup_scankey->stricthigh && offset <= maxoff)
+	{
+		Assert(itup_scankey->low >= P_FIRSTDATAKEY(opaque));
+		goto notfound;
+	}
+
+	InitDirtySnapshot(SnapshotDirty);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
@@ -391,7 +413,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_scankey, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +574,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -576,6 +601,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		}
 	}
 
+notfound:
+
 	/*
 	 * If we are doing a recheck then we should have found the tuple we are
 	 * checking.  Otherwise there's something very wrong --- probably, the
@@ -598,7 +625,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 
 /*
- *	_bt_findinsertloc() -- Finds an insert location for a tuple
+ *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
  *		If the new key is equal to one or more existing keys, we can
  *		legitimately place it anywhere in the series of equal keys --- in fact,
@@ -611,39 +638,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_scankey,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +700,24 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
+		int			cmpval;
+
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
-
-		/*
-		 * nope, so check conditions (b) and (c) enumerated above
-		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +760,91 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Perform microvacuuming of the page we're about to insert tuple on if it
+	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
+	 * forestall a page split, though.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
+	}
+
+	/* _bt_check_unique() callers often avoid binary search effort */
+	itup_scankey->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_scankey, buf);
+	Assert(!itup_scankey->restorebinsrch);
+	Assert(!restorebinsrch ||
+		   newitemoff == _bt_binsrch(rel, itup_scankey, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
+ *		insert on the page contained in buf when a choice must be made.
+ *		Preemptive microvacuuming is performed here when that could allow
+ *		caller to insert on to the page in buf.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right (caller
+ *		must always be able to still move right following call here).
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -1189,8 +1250,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user
-	 * data.
+	 * rightmost page case), all the items on the right half will be user data
+	 * (there is no existing high key that needs to be relocated to the new
+	 * right page).
 	 */
 	rightoff = P_HIKEY;
 
@@ -2311,24 +2373,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_scankey->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_scankey->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 1d72fe5408..d0cf73718f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1370,7 +1370,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_scankey;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,11 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
+
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..d169d80b8a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,7 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, key, *bufP,
 							  (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
@@ -144,7 +140,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +194,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +211,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +238,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +264,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +299,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +319,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -346,37 +334,71 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When itup_scankey.savebinsrch is set, modifies
+ * mutable fields of insertion scan key, so that a subsequent call where
+ * caller sets itup_scankey.savebinsrch can reuse the low and high bound
+ * of original binary search.  This makes the second binary search
+ * performed on the first leaf page landed on by inserters that do
+ * unique enforcement avoid doing any real comparisons in most cases.
+ * See _bt_findinsertloc() for further details.
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	isleaf = P_ISLEAF(opaque);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	Assert(!(key->restorebinsrch && key->savebinsrch));
+	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
 
-	/*
-	 * If there are no keys on the page, return the first available slot. Note
-	 * this covers two cases: the page is really empty (no keys), or it
-	 * contains only a high key.  The latter case is possible after vacuuming.
-	 * This can never happen on an internal page, however, since they are
-	 * never empty (an internal page must have children).
-	 */
-	if (high < low)
-		return low;
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If there are no keys on the page, return the first available slot.
+		 * Note this covers two cases: the page is really empty (no keys), or
+		 * it contains only a high key.  The latter case is possible after
+		 * vacuuming.  This can never happen on an internal page, however,
+		 * since they are never empty (an internal page must have children).
+		 */
+		if (unlikely(high < low))
+		{
+			if (key->savebinsrch)
+			{
+				/* Special case: strict high is not valid */
+				key->low = low;
+				key->stricthigh = high;
+				key->savebinsrch = false;
+			}
+			return low;
+		}
+		high++;					/* establish the loop invariant for high */
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = key->low;
+		high = key->stricthigh;
+		key->restorebinsrch = false;
+
+		/* Return the first slot, in line with original binary search */
+		if (high < low)
+			return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,22 +412,37 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
-
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
 
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+	stricthigh = high;
 	while (high > low)
 	{
 		OffsetNumber mid = low + ((high - low) / 2);
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
+	}
+
+	if (key->savebinsrch)
+	{
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
 	}
 
 	/*
@@ -415,7 +452,7 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +465,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +487,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +520,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +607,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +854,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +884,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +916,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +963,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +984,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1087,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scankey fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1126,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..eb40b94070 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,41 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert inskey;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	inskey = palloc(offsetof(BTScanInsertData, scankeys) +
+					sizeof(ScanKeyData) * indnkeyatts);
+	inskey->savebinsrch = inskey->restorebinsrch = false;
+	inskey->low = inskey->stricthigh = InvalidOffsetNumber;
+	inskey->nextkey = false;
+	inskey->keysz = Min(indnkeyatts, tupnatts);
+	skey = inskey->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +103,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are defensively
+		 * represented as NULL values, though they should still not
+		 * participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +127,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return inskey;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..dc2eafb566 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.  For details on its mutable state, see _bt_binsrch and
+ * _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v11-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchapplication/x-patch; name=v11-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchDownload

From 17ca50019f27328e0bab7615468e201f8491dcab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v11 3/7] Treat heap TID as part of the nbtree key space.

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID.  Non-unique index insertions will descend straight to the leaf page
that they'll insert on to (unless there is a concurrent page split).
This general approach has numerous benefits for performance, and is
prerequisite to teaching VACUUM to perform "retail index tuple
deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will generally truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

We no longer allow a search for free space among multiple pages full of
duplicates to "get tired", except when needed to preserve compatibility
with earlier versions.  This has significant benefits for free space
management in indexes with few distinct sets of non-heap-TID column
values.  However, without the next commit in the patch series (without
having "single value" mode and "many duplicates" mode within
_bt_dofindsplitloc()), these cases will be significantly regressed,
since they'll naively perform 50:50 splits without there being any hope
of reusing space left free on the left half of the split.

The maximum allowed size of new tuples is reduced by a single MAXALIGN()
quantum.  The user-facing definition of the "1/3 of a page" restriction
is already imprecise, and so does not need to be revised.  However,
there should be a compatibility note in the v12 release notes.  The new
maximum allowed size is 2704 bytes on 64-bit systems, down from 2712
bytes.
---
 contrib/amcheck/verify_nbtree.c              | 342 ++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 160 ++++---
 src/backend/access/nbtree/nbtinsert.c        | 270 +++++++-----
 src/backend/access/nbtree/nbtpage.c          | 198 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 105 ++++-
 src/backend/access/nbtree/nbtsort.c          |  89 ++--
 src/backend/access/nbtree/nbtutils.c         | 440 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 162 +++++--
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 22 files changed, 1475 insertions(+), 441 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9ec023fae3..6e058754f0 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -135,17 +139,22 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert _bt_mkscankey_minusinfkey(Relation rel,
+													 IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -202,6 +211,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -252,7 +262,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -322,8 +334,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -344,6 +356,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -804,7 +817,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -837,6 +851,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -863,7 +878,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -904,7 +920,58 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = _bt_mkscankey_minusinfkey(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a MAXALIGN() quantum of space from BTMaxItemSize() in order to
+		 * ensure that suffix truncation always has enough space to add an
+		 * explicit heap TID back to a tuple -- we pessimistically assume that
+		 * every newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since a MAXALIGN() quantum is reserved for that purpose, we must
+		 * not enforce the slightly lower limit when the extra quantum has
+		 * been used as intended.  In other words, there is only a
+		 * cross-version difference in the limit on tuple size within leaf
+		 * pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra quantum for
+		 * its designated purpose.  Enforce the lower limit for pivot tuples
+		 * when an explicit heap TID isn't actually present. (In all other
+		 * cases suffix truncation is guaranteed to generate a pivot tuple
+		 * that's no larger than the first right tuple provided to it by its
+		 * caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -929,9 +996,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -957,11 +1050,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1024,7 +1116,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1293,7 +1385,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return _bt_mkscankey_minusinfkey(state->rel, firstitup);
 }
 
 /*
@@ -1356,7 +1448,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1405,14 +1498,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1752,6 +1860,66 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  It is even capable of determining that a "minus
+	 * infinity value" from a "minusinfkey" scankey is equal to a pivot's
+	 * truncated attribute.  However, it is not capable of determining that a
+	 * scankey ("minusinfkey" or otherwise) is _less than_ a tuple on the
+	 * basis of a comparison resolved at _scankey_ minus infinity attribute.
+	 *
+	 * Somebody could teach _bt_compare() to handle this on its own, but that
+	 * would add overhead to index scans.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1771,42 +1939,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -1962,3 +2172,61 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically sets insertion scankey to have
+ * minus infinity values for truncated attributes from itup (when itup is a
+ * pivot tuple with one or more truncated attributes).
+ *
+ * In a non-corrupt heapkeyspace index, all pivot tuples on a level have
+ * unique keys, so the !minusinfkey optimization correctly guides scans that
+ * aren't interested in relocating a leaf page using leaf page's high key
+ * (i.e. optimization can safely be used by the vast majority of all
+ * _bt_search() calls).  nbtree verification should always use "minusinfkey"
+ * semantics, though, because the !minusinfkey optimization might mask a
+ * problem in a corrupt index.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !minusinfkey tie-breaker might otherwise
+ * cause amcheck to conclude that the scankey is greater, missing index
+ * corruption.  It's unlikely that the same problem would not be caught some
+ * other way, but the !minusinfkey optimization has no upside for amcheck, so
+ * it seems sensible to always avoid it.
+ */
+static inline BTScanInsert
+_bt_mkscankey_minusinfkey(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->minusinfkey = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 6a22b17203..53e43ce80e 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..be9bf61d47 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -595,36 +606,56 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
+insertion scankey uses a similar array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is at most one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey or built from a truncated pivot tuple, there might be
+fewer keys than index columns, indicating that we have no constraints for
+the remaining index columns.) After we have located the starting point of a
+scan, the original search scankey is consulted as each index entry is
+sequentially scanned to decide whether to return the entry and whether the
+scan can stop (see _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains all key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +668,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +695,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 458f8cb296..9018e5fe53 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_scankey,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_scankey, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -114,13 +116,16 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack = NULL;
 	Buffer		buf;
 	Page		page;
-	OffsetNumber offset;
 	BTPageOpaque lpageop;
 	bool		fastpath;
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
+top:
+	/* Cannot use real heap TID in unique case -- it'll be restored later */
+	if (itup_scankey->heapkeyspace && checkingunique)
+		itup_scankey->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -141,9 +146,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 * other backend might be concurrently inserting into the page, thus
 	 * reducing our chances to finding an insertion place in this page.
 	 */
-top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -225,12 +228,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -243,6 +247,7 @@ top:
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
+		OffsetNumber offset;
 
 		page = BufferGetPage(buf);
 		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -276,6 +281,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_scankey->heapkeyspace)
+			itup_scankey->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -292,10 +301,11 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
+		/* do the insertion, possibly on a page to the right in unique case */
 		insertoff = _bt_findinsertloc(rel, itup_scankey, &buf, checkingunique,
 									  itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, insertoff, false);
+		_bt_insertonpg(rel, itup_scankey, buf, InvalidBuffer, stack, itup,
+					   insertoff, false);
 	}
 	else
 	{
@@ -345,6 +355,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
 	bool		found = false;
 
 	/* Assume unique until we find a duplicate */
+	Assert(itup_scankey->scantid == NULL);
 	*is_unique = true;
 
 	page = BufferGetPage(buf);
@@ -580,6 +591,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_scankey,
 			if (P_RIGHTMOST(opaque))
 				break;
 			highkeycmp = _bt_compare(rel, itup_scankey, page, P_HIKEY);
+			/* scantid-less scankey should be <= hikey */
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -627,16 +639,16 @@ notfound:
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a new tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
  *		_bt_check_unique() callers arrange for their insertion scan key to
  *		save the progress of the last binary search performed.  No additional
@@ -679,46 +691,66 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
+	/* Check 1/3 of a page restriction */
 	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+		_bt_check_third_page(rel, heapRel, itup_scankey->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_scankey->heapkeyspace || itup_scankey->scantid != NULL);
+	Assert(itup_scankey->heapkeyspace || itup_scankey->scantid == NULL);
 	for (;;)
 	{
 		Buffer		rbuf;
 		BlockNumber rblkno;
 		int			cmpval;
 
+		/*
+		 * No need to check high key when inserting into a non-unique index --
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required.  Insertion scankey's scantid would have been
+		 * filled out at the time.
+		 */
+		if (itup_scankey->heapkeyspace && !checkingunique)
+		{
+			Assert(P_RIGHTMOST(lpageop) ||
+				   _bt_compare(rel, itup_scankey, page, P_HIKEY) <= 0);
+			break;
+		}
+
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_scankey, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_scankey->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -727,6 +759,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -741,7 +775,10 @@ _bt_findinsertloc(Relation rel,
 			 * If this page was incompletely split, finish the split now. We
 			 * do this while holding a lock on the left sibling, which is not
 			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
 			 */
 			if (P_INCOMPLETE_SPLIT(lpageop))
 			{
@@ -789,6 +826,11 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
  *		This function handles the question of whether or not an insertion
  *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
  *		insert on the page contained in buf when a choice must be made.
@@ -879,6 +921,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_scankey,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -901,7 +944,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -951,7 +994,7 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
+		rbuf = _bt_split(rel, itup_scankey, buf, cbuf, firstright,
 						 newitemoff, itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
@@ -1034,7 +1077,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1089,6 +1132,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1155,6 +1200,7 @@ _bt_insertonpg(Relation rel,
  *		firstright is the item index of the first item to be moved to the
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
+ *		itup_scankey is used for suffix truncation in leaf pages.
  *
  *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
  *		page we're inserting the downlink for.  This function will clear the
@@ -1164,9 +1210,9 @@ _bt_insertonpg(Relation rel,
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_scankey, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1261,7 +1307,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1275,8 +1322,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1294,25 +1342,58 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_scankey->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_scankey);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1505,7 +1586,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1534,22 +1614,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1567,9 +1635,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1932,7 +1998,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1962,7 +2028,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2222,7 +2288,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2257,7 +2323,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2290,6 +2357,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2354,6 +2423,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2369,8 +2439,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey, Page page,
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d0cf73718f..c578a412f5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,10 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is harmless and
- * indeed saves later search effort in _bt_pagedel.  The caller should
- * initialize *target and *rightsib to the leaf page and its right sibling.
+ * leading to it (it starts out as approximate in !heapkeyspace indexes, and
+ * needs to be re-checked in case it became stale in all cases).  Note that we
+ * will update the stack entry(s) to reflect current downlink positions ---
+ * this is harmless and indeed saves later search effort in _bt_pagedel.  The
+ * caller should initialize *target and *rightsib to the leaf page and its
+ * right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1421,9 +1492,20 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_scankey = _bt_mkscankey(rel, targetkey);
+				/* absent attributes need explicit minus infinity values */
+				itup_scankey->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_scankey, &lbuf, BT_READ, NULL);
 
+				/*
+				 * Search will reliably relocate same leaf page.
+				 *
+				 * (However, prior to version 4 the search is for the leftmost
+				 * leaf page containing this key, which is okay because we
+				 * will tiebreak on downlink block number.)
+				 */
+				Assert(!itup_scankey->heapkeyspace ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
@@ -1969,7 +2051,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2099,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d169d80b8a..f134321717 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -153,8 +153,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -252,11 +256,15 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.  Duplicate pivots on
+	 * internal pages are useless to all index scans, which was a flaw in the
+	 * old design.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -360,8 +368,13 @@ _bt_binsrch(Relation rel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	isleaf = P_ISLEAF(opaque);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* Restore binary search state after scantid filled-in by caller */
 	Assert(!(key->restorebinsrch && key->savebinsrch));
 	Assert(P_ISLEAF(opaque) || (!key->restorebinsrch && !key->savebinsrch));
+	Assert(!key->savebinsrch || key->scantid == NULL);
+	Assert(!key->heapkeyspace || !key->restorebinsrch || key->scantid != NULL);
 
 	if (!key->restorebinsrch)
 	{
@@ -494,19 +507,31 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(key->minusinfkey || key->heapkeyspace);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -520,8 +545,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -572,8 +599,65 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.  (This is also a convenient point to check if
+	 * the !minusinfkey optimization can be used.)
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches (all !minusinfkey searches) are not interested in
+		 * keys where minus infinity is explicitly represented, since that's a
+		 * sentinel value that never appears in non-pivot tuples.  It is safe
+		 * for these searches to have their scankey considered greater than a
+		 * truncated pivot tuple iff the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in pivot tuple.  The only would-be "match" that will be
+		 * "missed" is a single leaf page's high key (the leaf page whose high
+		 * key the values from affected pivot tuple originate from).
+		 *
+		 * This optimization prevents an extra leaf page visit when the
+		 * insertion scankey would otherwise be equal.  If this tiebreaker
+		 * wasn't performed, code like _bt_readpage() and _bt_readnextpage()
+		 * would often end up moving right having found no matches on the leaf
+		 * page that their search lands on initially.
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 */
+		if (!key->minusinfkey && key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1090,7 +1174,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scankey fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..5217047f3d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one MAXALIGN() quantum larger than the original first right tuple
+	 * it's derived from.  v4 deals with the problem by decreasing the limit
+	 * on the size of tuples inserted on the leaf level by the same small
+	 * amount.  Enforce the new v4+ limit on the leaf level, and the old limit
+	 * on internal levels, since pivot tuples may need to make use of the
+	 * spare MAXALIGN() quantum.  This should never fail on internal pages.
 	 */
 	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index eb40b94070..5dbe833850 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_scankey);
 
 
 /*
@@ -56,9 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's minusinfkey field.  This can be
+ *		thought of as explicitly representing that absent attributes in scan
+ *		key have minus infinity values.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -81,15 +99,38 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= indnatts);
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	inskey = palloc(offsetof(BTScanInsertData, scankeys) +
 					sizeof(ScanKeyData) * indnkeyatts);
+	inskey->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate high keys in heapkeyspace indexes.  Caller also
+	 * opts out of the "no minus infinity key" optimization, so search moves
+	 * left on scankey-equal downlink in parent, allowing VACUUM caller to
+	 * reliably relocate leaf page undergoing deletion.
+	 */
+	inskey->minusinfkey = !inskey->heapkeyspace;
 	inskey->savebinsrch = inskey->restorebinsrch = false;
 	inskey->low = inskey->stricthigh = InvalidOffsetNumber;
 	inskey->nextkey = false;
 	inskey->keysz = Min(indnkeyatts, tupnatts);
+	inskey->scantid = inskey->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = inskey->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -105,9 +146,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are defensively
-		 * represented as NULL values, though they should still not
-		 * participate in comparisons.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values,
+		 * though they should still not participate in comparisons.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2045,38 +2086,245 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()
+ * quantum.  This guarantee is important, since callers need to stay under
+ * the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_scankey)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_scankey);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Artificially force truncation to always append heap TID */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within keepnatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * This must be an INCLUDE index where only non-key attributes could
+		 * be truncated away.  They are not considered part of the key space,
+		 * so it's still necessary to add a heap TID attribute to the new
+		 * pivot tuple.  Create enlarged copy of our truncated right tuple
+		 * copy, to fit heap TID.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(pivot) + sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal, and
+		 * there are no non-key attributes that need to be truncated in
+		 * passing.  It's necessary to add a heap TID attribute to the new
+		 * pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = MAXALIGN(IndexTupleSize(firstright) + sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 *
+	 * Callers generally try to avoid choosing a split point that necessitates
+	 * that we do this.  Splits of pages that only involve a single distinct
+	 * value (or set of values) must end up here, though.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a single MAXALIGN()
+	 * quantum.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_scankey)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!itup_scankey->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_scankey->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2090,15 +2338,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2118,16 +2368,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2137,8 +2397,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2150,7 +2417,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2161,18 +2432,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index dc2eafb566..a9a5f9bdfc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,53 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
+ * general changes to the on-disk representation to add support for
+ * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
+ * semantics in pg_upgrade scenarios.  We continue to offer support for
+ * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
+ * up to and including v10 to v12+ without requiring a REINDEX.
+ * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
+ * support upgrades from v11 to v12+ without requiring a REINDEX.
+ *
+ * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
+ * to BTREE_NOVAC_VERSION automatically.  v11's "no vacuuming" enhancement
+ * (the ability to skip full index scans during vacuuming) only requires
+ * two new metapage fields, which makes it possible to upgrade at any
+ * point that the metapage must be updated anyway (e.g. during a root page
+ * split).  Note also that there happened to be no changes in metapage
+ * layout for btree version 4.  All current metapage fields should have
+ * valid values set when a metapage WAL record is replayed.
+ *
+ * It's convenient to consider the "no vacuuming" enhancement (metapage
+ * layout compatibility) separately from heapkeyspace semantics, since
+ * each issue affects different areas.  This is just a convention; in
+ * practice a heapkeyspace index is always also a "no vacuuming" index.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +238,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +279,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +297,46 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -326,25 +402,53 @@ typedef BTStackData *BTStack;
  * _bt_search.  For details on its mutable state, see _bt_binsrch and
  * _bt_findinsertloc.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.  When
+ * minusinfkey is false (the usual case), _bt_compare will consider a
+ * scankey greater than a pivot tuple where all explicitly represented
+ * attributes are equal to the scankey, provided that the pivot tuple has at
+ * least one attribute truncated away (this is often just the heap TID
+ * attribute).  We exploit the fact that minus infinity is a value that only
+ * appears in pivot tuples (to make suffix truncation work), and is therefore
+ * not interesting (page deletion by VACUUM is the one case where the
+ * optimization cannot be used, since a leaf page is relocated using its high
+ * key).  This optimization allows us to get the full benefit of suffix
+ * truncation, particularly with indexes where each distinct set of user
+ * attribute keys appear in at least a few duplicate entries.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +456,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +689,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +743,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_scankey);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..a4cbdff283 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 1d12b01068..06fe44d39a 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
-- 
2.17.1

v11-0004-Pick-nbtree-split-points-discerningly.patchapplication/x-patch; name=v11-0004-Pick-nbtree-split-points-discerningly.patchDownload

From ea81031c9f0f0e9466a632a1ac1eae5b75485318 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v11 4/7] Pick nbtree split points discerningly.

Add infrastructure to weigh how effective suffix truncation will be when
choosing a split point.  This should not noticeably affect the balance
of free space within each half of the split, while still making suffix
truncation truncate away significantly more attributes on average.

The logic for choosing a split point is also taught to care about the
case where there are many duplicates, making it hard to find a
distinguishing split point.  It may even conclude that the page being
split is already full of logical duplicates, in which case it packs the
left half very tightly, while leaving the right half mostly empty.  Our
assumption is that logical duplicates will almost always be inserted in
ascending heap TID order.  This strategy leaves most of the free space
on the half of the split that will likely be where future logical
duplicates of the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

This patch is required to credibly assess anything about the performance
of the patch series.  Applying the patches up to and including this
patch in the series is sufficient to see much better space utilization
and space reuse with cases where many duplicates are inserted.  (Cases
resulting in searches for free space among many pages full of
duplicates, where the search inevitably "gets tired" on the master
branch [1]).

[1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 295 +--------
 src/backend/access/nbtree/nbtsplitloc.c | 822 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  29 +-
 6 files changed, 950 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index be9bf61d47..cdd68b6f75 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -657,6 +657,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 9018e5fe53..37162c449e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_scankey, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_scankey,
@@ -894,8 +867,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -990,7 +962,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1325,6 +1297,11 @@ _bt_split(Relation rel, BTScanInsert itup_scankey, Buffer buf, Buffer cbuf,
 	 * to go into the new right page, or possibly a truncated version if this
 	 * is a leaf page split.  This might be either the existing data item at
 	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page.  Despite appearances, the new high key is generated in a way
+	 * that's consistent with their approach.  See comments above
+	 * _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1664,264 +1641,6 @@ _bt_split(Relation rel, BTScanInsert itup_scankey, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..7f337bac55
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,822 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* _bt_dofindsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_checksplitloc */
+	SplitMode	mode;			/* strategy for deciding split point */
+	Size		newitemsz;		/* size of new item to be inserted */
+	double		propfullonleft; /* want propfullonleft * leftfree on left */
+	int			goodenough;		/* good enough left/right space delta */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_weighted;	/* T if propfullonleft used by split */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
+} FindSplitData;
+
+static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
+				   SplitMode mode, OffsetNumber newitemoff,
+				   Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright, bool newitemonleft,
+				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of.  (This could be
+ * maxoff+1 if the tuple is to go at the end.)
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key.  It isn't, though;
+ * suffix truncation will leave the left page's high key equal to the last
+ * item on the left page when two tuples with equal key values enclose the
+ * split point.  It's convenient to always express a split point as a
+ * firstright offset due to internal page splits, which leave us with a
+ * right half whose first item becomes a negative infinity item through
+ * truncation to 0 attributes.  In effect, internal page splits store
+ * firstright's "separator" key at the end of the left page (as left's new
+ * high key), and store its downlink at the start of the right page.  In
+ * other words, internal page splits conceptually split in the middle of the
+ * firstright tuple, not on either side of it.  Crucially, when splitting
+ * either a leaf page or an internal page, the new high key will be strictly
+ * less than the first item on the right page in all cases, despite the fact
+ * that we start with the assumption that firstright becomes the new high
+ * key.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	/* Initial call always uses SPLIT_DEFAULT */
+	return _bt_dofindsplitloc(rel, page, SPLIT_DEFAULT, newitemoff, newitemsz,
+							  newitem, newitemonleft);
+}
+
+/*
+ *	_bt_dofindsplitloc() -- guts of find split location code.
+ *
+ * We give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case.  There
+ * is still a modest benefit to choosing a split location while weighing
+ * suffix truncation: the resulting (untruncated) pivot tuples are
+ * nevertheless more predictive of future space utilization.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.
+ *
+ * When called recursively in single value mode we try to arrange to leave
+ * the left split page even more full than in the fillfactor% rightmost page
+ * case.  This maximizes space utilization in cases where tuples with the
+ * same attribute values span across many pages.  Newly inserted duplicates
+ * will tend to have higher heap TID values, so we'll end up splitting to
+ * the right in the manner of ascending insertions of monotonically
+ * increasing values.  See nbtree/README for more information about suffix
+ * truncation, and how a split point is chosen.
+ */
+static OffsetNumber
+_bt_dofindsplitloc(Relation rel,
+				   Page page,
+				   SplitMode mode,
+				   OffsetNumber newitemoff,
+				   Size newitemsz,
+				   IndexTuple newitem,
+				   bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	OffsetNumber offnum;
+	OffsetNumber maxoff;
+	ItemId		itemid;
+	FindSplitData state;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	bool		goodenoughfound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items without actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.mode = mode;
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.is_leaf = P_ISLEAF(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+	state.splits = splits;
+	state.nsplits = 0;
+	if (!state.is_leaf)
+	{
+		Assert(state.mode == SPLIT_DEFAULT);
+
+		/* propfullonleft only used on rightmost page */
+		state.propfullonleft = BTREE_NONLEAF_FILLFACTOR / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		/* See is_leaf default mode remarks on maxsplits */
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	}
+	else if (state.mode == SPLIT_DEFAULT)
+	{
+		if (P_RIGHTMOST(opaque))
+		{
+			/*
+			 * Rightmost page splits are always weighted.  Extreme contention
+			 * on the rightmost page is relatively common, so we treat it as a
+			 * special case.
+			 */
+			state.propfullonleft = leaffillfactor / 100.0;
+			state.is_weighted = true;
+		}
+		else
+		{
+			/* propfullonleft won't be used, but be tidy */
+			state.propfullonleft = 0.50;
+			state.is_weighted = false;
+		}
+
+		/*
+		 * Set an initial limit on the split interval/number of candidate
+		 * split points as appropriate.  The "Prefix B-Trees" paper refers to
+		 * this as sigma l for leaf splits and sigma b for internal ("branch")
+		 * splits.  It's hard to provide a theoretical justification for the
+		 * size of the split interval, though it's clear that a small split
+		 * interval improves space utilization.
+		 */
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	}
+	else if (state.mode == SPLIT_MANY_DUPLICATES)
+	{
+		state.propfullonleft = leaffillfactor / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		state.maxsplits = maxoff + 2;
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	}
+	else
+	{
+		Assert(state.mode == SPLIT_SINGLE_VALUE);
+
+		state.propfullonleft = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		state.is_weighted = true;
+		state.maxsplits = 1;
+	}
+
+	/*
+	 * Finding the best possible split would require checking all the possible
+	 * split points, because of the high-key and left-key special cases.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as we find sufficiently-many "good-enough"
+	 * splits, where good-enough is defined as an imbalance in free space of
+	 * no more than pagesize/16 (arbitrary...) This should let us stop near
+	 * the middle on most pages, instead of plowing to the end.  Many
+	 * duplicates mode must consider all possible choices, and so does not use
+	 * this threshold for anything.
+	 */
+	state.goodenough = leftspace / 16;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.
+	 */
+	olddataitemstoleft = 0;
+	goodenoughfound = false;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+		int			delta;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
+
+		else if (offnum < newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		else
+		{
+			/* need to try it both ways! */
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
+
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		}
+
+		/* Record when good-enough choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.goodenough)
+			goodenoughfound = true;
+
+		/*
+		 * Abort scan once we've found a good-enough choice, and reach the
+		 * point where we stop finding new good-enough choices.  Don't do this
+		 * in many duplicates mode, though, since that must be almost
+		 * completely exhaustive.
+		 */
+		if (goodenoughfound && state.mode != SPLIT_MANY_DUPLICATES &&
+			delta > state.goodenough)
+			break;
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, check for splitting so that all
+	 * the old items go to the left page and the new item goes to the right
+	 * page.
+	 */
+	if (newitemoff > maxoff && !goodenoughfound)
+		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to choose a split point whose new high key is amenable to
+	 * being made smaller by suffix truncation, or is already small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  The perfect penalty often saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.  This is also a
+	 * convenient point to determine if a second pass over page is required.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Perform second pass over page when _bt_perfect_penalty() tells us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_dofindsplitloc(rel, page, secondmode, newitemoff,
+								  newitemsz, newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
+}
+
+/*
+ * Subroutine to analyze a particular possible split choice (ie, firstright
+ * and newitemonleft settings), and record the best split so far in *state.
+ *
+ * firstoldonright is the offset of the first item on the original page
+ * that goes to the right page, and firstoldonrightsz is the size of that
+ * tuple. firstoldonright can be > max offset, which means that all the old
+ * items go to the left page and only the new item goes to the right page.
+ * In that case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of
+ * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
+ */
+static int
+_bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright,
+				  bool newitemonleft,
+				  int olddataitemstoleft,
+				  Size firstoldonrightsz)
+{
+	int			leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, there often won't be an entire MAXALIGN()
+	 * quantum in pivot space savings.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int) (firstrightitemsz +
+						   MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int) state->newitemsz;
+	else
+		rightfree -= (int) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int) firstrightitemsz -
+			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/*
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
+	 */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		int			delta;
+
+		if (state->is_weighted)
+			delta = state->propfullonleft * leftfree -
+				(1.0 - state->propfullonleft) * rightfree;
+		else
+			delta = leftfree - rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway.
+		 *
+		 * We cannot do this in many duplicates mode, because that hurts cases
+		 * where there are a small number of available distinguishing split
+		 * points, and consistently picking the least worst choice among them
+		 * matters. (e.g., a non-unique index whose leaf pages each contain a
+		 * small number of distinct values, with each value duplicated a
+		 * uniform number of times.)
+		 */
+		if (delta > state->goodenough && state->mode != SPLIT_MANY_DUPLICATES)
+			delta = state->goodenough + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
+		{
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/*
+	 * No point in calculating penalty when there's only one choice.  Note
+	 * that single value mode always has one choice.
+	 */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	Assert(state->mode == SPLIT_DEFAULT ||
+		   state->mode == SPLIT_MANY_DUPLICATES);
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 *
+	 * Single value mode splits only occur when appending a heap TID was
+	 * already deemed necessary.  Don't waste any more cycles trying to avoid
+	 * that outcome.
+	 */
+	if (state->mode == SPLIT_MANY_DUPLICATES)
+		return indnkeyatts;
+	else if (state->mode == SPLIT_SINGLE_VALUE)
+		return indnkeyatts + 1;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = indnkeyatts;
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller should perform, if any, when
+	 * even their "perfect" penalty fails to avoid appending a heap TID to new
+	 * pivot tuple
+	 */
+	if (perfectpenalty > indnkeyatts)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			origpagepenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates and it appears that the duplicates have been
+		 * inserted in sequential order (i.e. heap TID order), a single value
+		 * mode pass will be performed.
+		 *
+		 * Deliberately ignore new item here, since a split that leaves only
+		 * one item on either page is often deemed unworkable by
+		 * _bt_checksplitloc().
+		 */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		leftmost = (IndexTuple) PageGetItem(page, itemid);
+		itemid = PageGetItemId(page, maxoff);
+		rightmost = (IndexTuple) PageGetItem(page, itemid);
+		origpagepenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+
+		if (origpagepenalty <= indnkeyatts)
+			*secondmode = SPLIT_MANY_DUPLICATES;
+		else if (newitemoff > maxoff)
+			*secondmode = SPLIT_SINGLE_VALUE;
+
+		/*
+		 * Have caller continue with original default mode split when new
+		 * duplicate item would not go at the end of the page.  Out-of-order
+		 * duplicate insertions predict further inserts towards the
+		 * left/middle of the original page's keyspace.  Evenly sharing space
+		 * among each half of the split avoids pathological performance.
+		 */
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	firstright;
+	IndexTuple	lastleft;
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		OffsetNumber lastleftoff;
+
+		lastleftoff = OffsetNumberPrev(split->firstright);
+		itemid = PageGetItemId(page, lastleftoff);
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastleft != firstright);
+	return _bt_keep_natts_fast(rel, lastleft, firstright);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5dbe833850..5b2d152d66 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2327,6 +2328,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a9a5f9bdfc..4428109cae 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -168,11 +168,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the leaf-page
+ * fillfactor is overridden, and is applied regardless of whether
+ * the page is a rightmost page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	99
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -418,7 +422,19 @@ typedef BTStackData *BTStack;
  * optimization cannot be used, since a leaf page is relocated using its high
  * key).  This optimization allows us to get the full benefit of suffix
  * truncation, particularly with indexes where each distinct set of user
- * attribute keys appear in at least a few duplicate entries.
+ * attribute keys appear in at least a few duplicate entries.  The split point
+ * location logic goes to great lengths to make groups of duplicates all
+ * appear together on a single leaf page after the split; subsequent searches
+ * should avoid unnecessarily reading and processing the sibling that's to the
+ * left of the leaf page that matching entries can first appear on.  Some
+ * later insertion scankey attribute could break the would-be tie with a
+ * truncated/minus infinity attribute, but when that doesn't happen this
+ * optimization breaks the would-be tie instead.  This optimization is even
+ * effective with unique index insertion, where a scantid value is not used
+ * until we reach the leaf level.  It might be necessary to visit multiple
+ * leaf pages during unique checking, but only in the rare case where more
+ * than a single leaf page can store duplicates (concurrent page splits are
+ * another possible reason).
  *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
@@ -679,6 +695,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -745,6 +768,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_scankey);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v11-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v11-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 1cf4aec8bcb9952c1af465977f01de495caada54 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v11 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v11-0006-Add-high-key-continuescan-optimization.patchapplication/x-patch; name=v11-0006-Add-high-key-continuescan-optimization.patchDownload

From 3639cdfb4b5a0f6e453197326aa576806742dfab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v11 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 24 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index f134321717..680da7e403 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1348,7 +1348,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber maxoff;
 	int			itemIndex;
 	IndexTuple	itup;
-	bool		continuescan;
+	bool		continuescan = true;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1416,16 +1416,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5b2d152d66..6948aa983f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_scankey);
@@ -1364,11 +1364,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1378,6 +1381,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1391,24 +1395,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1420,6 +1421,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1431,11 +1433,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1566,8 +1581,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1584,6 +1599,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v11-0005-Add-split-at-new-tuple-page-split-optimization.patchapplication/x-patch; name=v11-0005-Add-split-at-new-tuple-page-split-optimization.patchDownload

From 7a0dd644a1b26cc3c80469aea0e2c4edc3d86f8a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v11 5/7] Add split-at-new-tuple page split optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values by recognizing cases where a newly inserted tuple has a heap TID
that's slightly greater than that of the existing tuple to the immediate
left, but isn't just a duplicate.  It can greatly help space utilization
to split between two groups of localized monotonically increasing
values.

Without this patch, affected cases will reliably leave leaf pages no
more than about 50% full.  50/50 page splits are only appropriate with a
pattern of truly random insertions.  The optimization is very similar to
the long established fillfactor optimization used during rightmost page
splits, where we usually leave the new left side of the split 90% full.
Split-at-new-tuple page splits target essentially the same case. The
splits targeted are those at the rightmost point of a localized grouping
of values, rather than those at the rightmost point of the entire key
space.

This enhancement is very effective at avoiding index bloat when initial
bulk INSERTs for the TPC-C benchmark are run, and throughout the TPC-C
benchmark.  The TPC-C issue has been independently observed and reported
on [1].  Evidently, the primary keys for all of the largest indexes in
the TPC-C schema are populated through localized, monotonically
increasing values:

Master
======

order_line_pkey: 774 MB
stock_pkey: 181 MB
idx_customer_name: 107 MB
oorder_pkey: 78 MB
customer_pkey: 75 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
new_order_pkey: 22 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Patch series, up to and including this commit
=============================================

order_line_pkey: 451 MB
stock_pkey: 114 MB
idx_customer_name: 105 MB
oorder_pkey: 45 MB
customer_pkey: 48 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
new_order_pkey: 13 MB
item_pkey: 2216 kB
district_pkey: 40 kB
warehouse_pkey: 24 kB

Without this patch, but with all previous patches in the series, a much
more modest reduction in the volume of bloat occurs when the same test
case is run.  There is a reduction in the size of the largest index (the
order line primary key) of ~5% of its original size, whereas we see a
reduction of ~42% here.

The problem can easily be recreated by bulk loading using benchmarkSQL
(a fair use TPC-C implementation) while avoiding building indexes with
CREATE INDEX [2].  Note that the patch series generally has less of an
advantage over master if the indexes are initially built with CREATE
INDEX (use my fork of BenchmarkSQL [3] to run a TPC-C benchmark while
avoiding having CREATE INDEX mask the problems on the master branch).

[1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c
[2] https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc
[3] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations
---
 src/backend/access/nbtree/nbtsplitloc.c | 184 ++++++++++++++++++++++++
 1 file changed, 184 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 7f337bac55..3edf97bfeb 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,9 @@ static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_splitatnewitem(Relation rel, Page page, int leaffillfactor,
+				   OffsetNumber newitemoff, IndexTuple newitem,
+				   double *propfullonleft);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -72,6 +75,7 @@ static int _bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
 					SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 
 
 /*
@@ -243,6 +247,12 @@ _bt_dofindsplitloc(Relation rel,
 			state.propfullonleft = leaffillfactor / 100.0;
 			state.is_weighted = true;
 		}
+		else if (_bt_splitatnewitem(rel, page, leaffillfactor, newitemoff,
+									newitem, &state.propfullonleft))
+		{
+			/* propfullonleft was set for us */
+			state.is_weighted = true;
+		}
 		else
 		{
 			/* propfullonleft won't be used, but be tidy */
@@ -540,6 +550,152 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split at the
+ * point that the new/incoming item would have been inserted, leaving the
+ * incoming tuple as the last tuple on the new left page.  When the new item
+ * is at the first or last offset, a fillfactor is applied so that space
+ * utilization is comparable to the traditional rightmost split case.
+ *
+ * This routine targets splits in composite indexes that consist of one or
+ * more leading columns that describe some grouping, plus a trailing column
+ * with ascending (or descending) values.  This pattern is prevalent in real
+ * world applications.  Consider the example of a composite index on
+ * (supplier_id, invoice_id), where there are a small, nearly-fixed number of
+ * suppliers, and invoice_id is an identifier assigned in ascending order (it
+ * doesn't matter whether or not suppliers are assigned invoice_id values from
+ * the same counter, or their own counter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by unweighted/50:50
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index.
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  Testing
+ * has shown the effect to be self-limiting; a future grouping that becomes
+ * the "nearest on the right" grouping of the affected grouping usually puts
+ * the extra free space to good use instead.
+ *
+ * Caller uses propfullonleft rather than using the new item offset directly
+ * because not all offsets will be deemed legal as split points.  This also
+ * allows us to apply leaf fillfactor in the common case where the new
+ * insertion is after the last offset (or at the first offset).
+ */
+static bool
+_bt_splitatnewitem(Relation rel, Page page, int leaffillfactor,
+				   OffsetNumber newitemoff, IndexTuple newitem,
+				   double *propfullonleft)
+{
+	OffsetNumber maxoff;
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	Size		tupspace;
+	Size		hikeysize;
+	int			keepnatts;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Proceed only when items on page look fairly short */
+	if (maxoff < MaxIndexTuplesPerPage / 2)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are not all of uniform size,
+	 * with the exception of the high key (existing high key may be smaller
+	 * due to truncation).  Surmise that page has equisized tuples when page
+	 * layout is consistent with having maxoff-1 non-pivot tuples that are all
+	 * the same size as the newly inserted tuple.
+	 */
+	tupspace = ((PageHeader) page)->pd_special - ((PageHeader) page)->pd_upper;
+	Assert(!P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)));
+	itemid = PageGetItemId(page, P_HIKEY);
+	hikeysize = ItemIdGetLength(itemid);
+	if (IndexTupleSize(newitem) * (maxoff - 1) != tupspace - hikeysize)
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in antecedent tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 */
+	if (newitemoff == P_FIRSTKEY)
+	{
+		/* Try to infer descending insertion pattern */
+		itemid = PageGetItemId(page, P_FIRSTKEY);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+		{
+			*propfullonleft = (double) Max(100 - leaffillfactor,
+										   BTREE_MIN_FILLFACTOR) / 100.0;
+			return true;
+		}
+
+		return false;
+	}
+	else if (newitemoff > maxoff)
+	{
+		/* Try to infer ascending insertion pattern */
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+		{
+			*propfullonleft = (double) leaffillfactor / 100.0;
+			return true;
+		}
+
+		return false;
+	}
+
+	/*
+	 * When item isn't first or last on page, try to infer ascending insertion
+	 * pattern.  We try to split at the precise point of the insertion here,
+	 * rather than applying leaf fillfactor.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index on '(country_id,
+	 * event_uuid)') may sometimes end up having the optimization applied
+	 * instead of getting a 50:50 (unweighted) page split.  This is
+	 * suboptimal.
+	 *
+	 * We're willing to accept that outcome when an incoming/new tuple is
+	 * either to the left or to the right of all existing items on the page,
+	 * since that's expected for less than 1% of all page splits that occur in
+	 * the index's lifetime (assuming default BLCKSZ).  More care must be
+	 * taken here, where we consider splits involving the new item being
+	 * inserted at neither edge of the page: we proceed only when new item's
+	 * heap TID is "adjacent" to the heap TID of the existing tuple to the
+	 * immediate left of the offset for the new item.  Heap TID adjacency
+	 * strongly suggests that the item just to the left was inserted very
+	 * recently.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+	if (!_bt_adjacenthtid(&tup->t_tid, &newitem->t_tid))
+		return false;
+	/* Also check the usual conditions */
+	keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		*propfullonleft = (double) newitemoff / (((double) maxoff + 1));
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -820,3 +976,31 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	Assert(lastleft != firstright);
 	return _bt_keep_natts_fast(rel, lastleft, firstright);
 }
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
-- 
2.17.1

#53

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#51)

2 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 09/01/2019 02:47, Peter Geoghegan wrote:

On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm envisioning that you have an array, with one element for each item
on the page (including the tuple we're inserting, which isn't really on
the page yet). In the first pass, you count up from left to right,
filling the array. Next, you compute the complete penalties, starting
from the middle, walking outwards.

Ah, right. I think I see what you mean now.

Leave it with me. I'll need to think about this some more.

Attached is v10 of the patch series, which has many changes based on
your feedback. However, I didn't end up refactoring _bt_findsplitloc()
in the way you described, because it seemed hard to balance all of the
concerns there. I still have an open mind on this question, andAt a
recognize the merit in what you suggested. Perhaps it's possible to
reach a compromise here.

I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different. At a very high level, I
believe the goals can be described as:

1. Find out how much suffix truncation is possible, i.e. how many key
columns can be truncated away, in the best case, among all possible ways
to split the page.

2. Among all the splits that achieve that optimum suffix truncation,
find the one with smallest "delta".

For performance reasons, it doesn't actually do it in that order. It's
more like this:

1. First, scan all split positions, recording the 'leftfree' and
'rightfree' at every valid split position. The array of possible splits
is kept in order by offset number. (This scans through all items, but
the math is simple, so it's pretty fast)

2. Compute the optimum suffix truncation, by comparing the leftmost and
rightmost keys, among all the possible split positions.

3. Split the array of possible splits in half, and process both halves
recursively. The recursive process "zooms in" to the place where we'd
expect to find the best candidate, but will ultimately scan through all
split candidates, if no "good enough" match is found.

I've only been testing this on leaf splits. I didn't understand how the
penalty worked for internal pages in your patch. In this version, the
same algorithm is used for leaf and internal pages. I'm sure this still
has bugs in it, and could use some polishing, but I think this will be
more readable way of doing it.

What have you been using to test this? I wrote the attached little test
extension, to explore what _bt_findsplitloc() decides in different
scenarios. It's pretty rough, but that's what I've been using while
hacking on this. It prints output like this:

postgres=# select test_split();
NOTICE: test 1:
left 2/358: 1 0
left 358/358: 1 356
right 1/ 51: 1 357
right 51/ 51: 1 407 SPLIT TUPLE
split ratio: 12/87

NOTICE: test 2:
left 2/358: 0 0
left 358/358: 356 356
right 1/ 51: 357 357
right 51/ 51: 407 407 SPLIT TUPLE
split ratio: 12/87

NOTICE: test 3:
left 2/358: 0 0
left 320/358: 10 10 SPLIT TUPLE
left 358/358: 48 48
right 1/ 51: 49 49
right 51/ 51: 99 99
split ratio: 12/87

NOTICE: test 4:
left 2/309: 1 100
left 309/309: 1 407 SPLIT TUPLE
right 1/100: 2 0
right 100/100: 2 99
split ratio: 24/75

Each test consists of creating a temp table with one index, and
inserting rows in a certain pattern, until the root page splits. It then
prints the first and last tuples on both pages, after the split, as well
as the tuple that caused the split. I don't know if this is useful to
anyone but myself, but I thought I'd share it.

- Heikki

#54

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#53)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different. At a very high level, I
believe the goals can be described as:

1. Find out how much suffix truncation is possible, i.e. how many key
columns can be truncated away, in the best case, among all possible ways
to split the page.

2. Among all the splits that achieve that optimum suffix truncation,
find the one with smallest "delta".

Thanks for going to the trouble of implementing what you have in mind!

I agree that the code that I wrote within nbtsplitloc.c is very
subtle, and I do think that I have further work to do to make its
design clearer. I think that this high level description of the goals
of the algorithm is inaccurate in subtle but important ways, though.
Hopefully there will be a way of making it more understandable that
preserves certain important characteristics. If you had my test
cases/data that would probably help you a lot (more on that later).

The algorithm I came up with almost always does these two things in
the opposite order, with each considered in clearly separate phases.
There are good reasons for this. We start with the same criteria as
the old algorithm. We assemble a small array of candidate split
points, rather than one split point, but otherwise it's almost exactly
the same (the array is sorted by delta). Then, at the very end, we
iterate through the small array to find the best choice for suffix
truncation. IOW, we only consider suffix truncation as a *secondary*
goal. The delta is still by far the most important thing 99%+ of the
time. I assume it's fairly rare to not have two distinct tuples within
9 or so tuples of the delta-wise optimal split position -- 99% is
probably a low estimate, at least in OLTP app, or within unique
indexes. I see that you do something with a "good enough" delta that
seems like it also makes delta the most important thing, but that
doesn't appear to be, uh, good enough. ;-)

Now, it's true that my approach does occasionally work in a way close
to what you describe above -- it does this when we give up on default
mode and check "how much suffix truncation is possible?" exhaustively,
for every possible candidate split point. "Many duplicates" mode kicks
in when we need to be aggressive about suffix truncation. Even then,
the exact goals are different to what you have in mind in subtle but
important ways. While "truncating away the heap TID" isn't really a
special case in other places, it is a special case for my version of
nbtsplitloc.c, which more or less avoids it at all costs. Truncating
away heap TID is more important than truncating away any other
attribute by a *huge* margin. Many duplicates mode *only* specifically
cares about truncating the final TID attribute. That is the only thing
that is ever treated as more important than delta, though even there
we don't forget about delta entirely. That is, we assume that the
"perfect penalty" is nkeyatts when in many duplicates mode, because we
don't care about suffix truncation beyond heap TID truncation by then.
So, if we find 5 split points out of 250 in the final array that avoid
appending heap TID, we use the earliest/lowest delta out of those 5.
We're not going to try to maximize the number of *additional*
attributes that get truncated, because that can make the leaf pages
unbalanced in an *unbounded* way. None of these 5 split points are
"good enough", but the distinction between their deltas still matters
a lot. We strongly prefer a split with a *mediocre* delta to a split
with a *terrible* delta -- a bigger high key is the least of our
worries here. (I made similar mistakes myself months ago, BTW.)

Your version of the algorithm makes a test case of mine (UK land
registry test case [1]/messages/by-id/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com -- Peter Geoghegan) go from having an index that's 1101 MB with my
version of the algorithm/patch and 1329 MB on the master branch to an
index that's 3030 MB in size. I think that this happens because it
effectively fails to give any consideration to delta at all at certain
points. On leaf pages with lots of unique keys, your algorithm does
about as well as mine because all possible split points look equally
good suffix-truncation-wise, plus you have the "good enough" test, so
delta isn't ignored. I think that your algorithm also works well when
there are many duplicates but only one non-TID index column, since the
heap TID truncation versus other truncation issue does not arise. The
test case I used is an index on "(county, city, locality)", though --
low cardinality, but more than a single column. That's a *combination*
of two separate considerations, that seem to get conflated. I don't
think that you can avoid doing "a second pass" in some sense, because
these really are separate considerations.

There is an important middle-ground that your algorithm fails to
handle with this test case. You end up maximizing the number of
attributes that are truncated when you shouldn't -- leaf page splits
are totally unbalanced much of the time. Pivot tuples are smaller on
average, but are also far far more numerous, because there are more
leaf page splits as a result of earlier leaf page splits being
unbalanced. If instead you treated heap TID truncation as the only
thing that you were willing to go to huge lengths to prevent, then
unbalanced splits become *self-limiting*. The next split will probably
end up being a single value mode split, which packs pages full of
duplicates at tightly as possible.

Splits should "degrade gracefully" from default mode to many
duplicates mode to single value mode in cases where the number of
distinct values is constant (or almost constant), but the total number
of tuples grows over time.

I've only been testing this on leaf splits. I didn't understand how the
penalty worked for internal pages in your patch. In this version, the
same algorithm is used for leaf and internal pages.

The approach that I use for internal pages is only slightly different
to what we've always done -- I split very near the delta-wise optimal
point, with a slight preference for a tuple that happens to be
smaller. And, there is no way in which the delta-optimal point can be
different to what it would have been on master with internal pages
(they only use default mode). I don't think it's appropriate to use
the same algorithm for leaf and internal page splits at all. We cannot
perform suffix truncation on internal pages.

What have you been using to test this? I wrote the attached little test
extension, to explore what _bt_findsplitloc() decides in different
scenarios.

I've specifically tested the _bt_findsplitloc() stuff using a couple
of different techniques. Primarily, I've been using lots of real world
data and TPC benchmark test data, with expected/test output generated
by a contrib/pageinspect query that determines the exact number of
leaf blocks and internal page blocks from each index in a test
database. Just bash and SQL. I'm happy to share that with you, if
you're able to accept a couple of gigabytes worth of dumps that are
needed to make the scripts work. Details:

pg@bat:~/hdd/sample-data$ ll land_registry.custom.dump
-rw------- 1 pg pg 1.1G Mar 3 2018 land_registry.custom.dump
pg@bat:~/hdd/sample-data$ ll tpcc_2018-07-20_unlogged.dump
-rw-rw-r-- 1 pg pg 1.8G Jul 20 2018 tpcc_2018-07-20_unlogged.dump

(The only other components for these "fast" tests are simple bash scripts.)

I think that you'd find it a lot easier to work with me on these
issues you at least had these tests -- my understanding of the
problems was shaped by the tests. I strongly recommend that you try
out my UK land registry test and the TPC-C test as a way of
understanding the design I've used for _bt_findsplitloc(). It
shouldn't be that inconvenient to get it over to you. I have several
more tests besides these two, but they're much more cumbersome and
much less valuable. I have a script that I can run in 5 minutes that
probably catches all the regressions. The long running stuff, like my
TPC-E test case (the stuff that I won't bother sending) hasn't caught
any regressions that the fast tests didn't catch as well.

Separately, I also have a .gdbinit function that looks like this:

define dump_page
dump binary memory /tmp/gdb_postgres_page.dump $arg0 ($arg0 + 8192)
echo Invoking pg_hexedit + wxHexEditor on page...\n
! ~/code/pg_hexedit/pg_hexedit -n 1 /tmp/gdb_postgres_page.dump >
/tmp/gdb_postgres_page.dump.tags
! ~/code/wxHexEditor/wxHexEditor /tmp/gdb_postgres_page.dump &> /dev/null
end

This allows me to see an arbitrary page from an interactive gdb
session using my pg_hexedit tool. I can simply "dump_page page" from
most functions in the nbtree source code. At various points I found it
useful to add optimistic assertions to the split point choosing
routines that failed. I could then see why they failed by using gdb
with the resulting core dump. I could look at the page image using
pg_hexedit/wxHexEditor from there. This allowed me to understand one
or two corner cases. For example, this is how I figured out the exact
details at the end of _bt_perfect_penalty(), when it looks like we're
about to go into a second pass of the page.

It's pretty rough, but that's what I've been using while
hacking on this. It prints output like this:

Cool! I did have something that would LOG the new high key in an easy
to interpret way at one point, which was a little like this.

[1]: /messages/by-id/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#55

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#54)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Thanks for going to the trouble of implementing what you have in mind!

I agree that the code that I wrote within nbtsplitloc.c is very
subtle, and I do think that I have further work to do to make its
design clearer. I think that this high level description of the goals
of the algorithm is inaccurate in subtle but important ways, though.
Hopefully there will be a way of making it more understandable that
preserves certain important characteristics.

Heikki and I had the opportunity to talk about this recently. We found
an easy way forward. I believe that the nbtsplitloc.c algorithm itself
is fine -- the code will need to be refactored, though.

nbtsplitloc.c can be refactored to assemble a list of legal split
points up front, before deciding which one to go with in a separate
pass (using one of two "alternative modes", as before). I now
understand that Heikki simply wants to separate the questions of "Is
this candidate split point legal?" from "Is this known-legal candidate
split point good/ideal based on my current criteria?". This seems like
a worthwhile goal to me. Heikki accepts the need for multiple
modes/passes, provided recursion isn't used in the implementation.

It's clear to me that the algorithm should start off trying to split
towards the middle of the page (or towards the end in the rightmost
case), while possibly making a small compromise on the exact split
point to maximize the effectiveness of suffix truncation. We must
change strategy entirely if and only if the middle of the page (or
wherever we'd like to split initially) is found to be completely full
of duplicates -- that's where the need for a second pass comes in.
This should almost never happen in most applications. Even when it
happens, we only care about not splitting inside a group of
duplicates. That's not the same thing as caring about maximizing the
number of attributes truncated away. Those two things seem similar,
but are actually very different.

It might have sounded like Heikki and I disagreed on the design of the
algorithm at a high level, or what its goals ought to be. That is not
the case, though. (At least not so far.)

--
Peter Geoghegan

#56

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#55)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Feb 5, 2019 at 4:50 PM Peter Geoghegan <pg@bowt.ie> wrote:

Heikki and I had the opportunity to talk about this recently. We found
an easy way forward. I believe that the nbtsplitloc.c algorithm itself
is fine -- the code will need to be refactored, though.

Attached v12 does not include this change, though I have every
intention of doing the refactoring described for v13. The
nbtsplitloc.c/split algorithm refactoring would necessitate
revalidating the patch's performance, though, which didn't seem worth
blocking on. Besides, there was bit rot that needed to be fixed.

Notable improvements in v12:

* No more papering-over regression test differences caused by
pg_depend issues, thanks to recent work by Tom (today's commit
1d92a0c9).

* I simplified the code added to _bt_binsrch() to deal with saving and
restoring binary search bounds for _bt_check_unique()-caller
insertions (this is from first/"Refactor nbtree insertion scankeys"
patch). I also improved matters within _bt_check_unique() itself: the
early "break" there (based on reaching the known strict upper bound
from cache binary search) works in terms of the existing
_bt_check_unique() loop invariant.

This even allowed me to add a new assertion that makes sure that
breaking out of the loop early is correct -- we call _bt_isequal() for
next item on assert-enabled builds when we break having reached strict
upper bound established by initial binary search. In other words,
_bt_check_unique() ends up doing the same number of _bt_isequal()
calls as it did on the master branch, provided assertions are enabled.

* I've restored regression test coverage that the patch previously
inadvertently took away. Suffix truncation made deliberately-tall
B-Tree indexes from the regression tests much shorter, making the
tests fail to test the code paths the tests originally targeted. I
needed to find ways to "defeat" suffix truncation so I still ended up
with a fairly tall tree that hit various code paths.

I think that we went from having 8 levels in btree_tall_idx (i.e.
ridiculously many) to having only a single root page when I first
caught the problem! Now btree_tall_idx only has 3 levels, which is all
we really need. Even multi-level page deletion didn't have any
coverage in previous versions. I used gcov to specifically verify that
we have good multi-level page deletion coverage. I also used gcov to
make sure that we have coverage of the v11 "cache rightmost block"
optimization, since I noticed that that was missing (though present on
the master branch) -- that's actually all that the btree_tall_idx
tests in the patch, since multi-level page deletion is covered by a
covering-indexes-era test. Finally, I made sure that we have coverage
of fast root splits. In general, I preserved the original intent
behind the existing tests, all of which I was fairly familiar with
from previous projects.

* I've added a new "relocate" bt_index_parent_check()/amcheck option,
broken out in a separate commit. This new option makes verification
relocate each and every leaf page tuple, starting from the root each
time. This means that there will be at least one piece of code that
specifically relies on "every tuple should have a unique key" from the
start, which seems like a good idea.

This enhancement to amcheck allows me to detect various forms of
corruption that no other existing verification option would catch. In
particular, I can catch various very subtle "cross-cousin
inconsistencies" that require that we verify a page using its
grandparent rather than its parent [1]http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf -- Peter Geoghegan (existing checks catch some but
not all "cousin problem" corruption). Simply put, this amcheck
enhancement allows me to detect corruption of the least significant
byte in a key value in the root page -- that kind of corruption will
cause index scans to miss only a small number of tuples at the leaf
level. Maybe this scenario isn't realistic, but I'd rather not take
any chances.

* I rethought the "single value mode" fillfactor, which I've been
suspicious of for a while now. It's now 96, down from 99.

Micro-benchmarks involving concurrent sessions inserting into a low
cardinality index led me to the conclusion that 99 was aggressively
high. It was not that hard to get excessive page splits with these
microbenchmarks, since insertions with monotonically increasing heap
TIDs arrived a bit out of order with a lot of concurrency. 99 worked a
bit better than 96 with only one session, but significantly worse with
concurrent sessions. I still think that it's a good idea to be more
aggressive than default leaf fillfactor, but reducing "single value
mode" fillfactor to 90 (or whatever the user set general leaf
fillfactor to) wouldn't be so bad.

[1]: http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v12-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v12-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 67da20eeeed84097523e30c9046fe80ee90b41b8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v12 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v12-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchapplication/x-patch; name=v12-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From db4331c53dd903d3686411519ad190380a215778 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v12 6/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 157 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 181 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a7d060b3ec..151c6d5fdb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -74,6 +74,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -123,10 +125,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -139,6 +142,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -176,7 +180,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -195,11 +199,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -208,7 +215,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -266,7 +274,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -337,7 +345,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -361,6 +369,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -429,6 +438,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -921,6 +938,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_minusinfkey(state->rel, itup);
 
@@ -1525,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1925,6 +1971,101 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ *
+ * Alternative nbtree design that could be used perform "cousin verification"
+ * inexpensively/transitively (may make current approach clearer):
+ *
+ * A cousin leaf page has a lower bound that comes from its grandparent page
+ * rather than its parent page, as already discussed (note also that a "second
+ * cousin" leaf page gets its lower bound from its great-grandparent, and so
+ * on).  Any pivot tuple in the root page after the first tuple (which is an
+ * "absolute" negative infinity tuple, since its leftmost on the level) should
+ * separate every leaf page into <= and > pages.  Even with the current
+ * design, there should be an unbroken seam of identical-to-root-pivot high
+ * key separator values at the right edge of the <= pages, all the way down to
+ * (and including) the leaf level.  Recall that page deletion won't delete the
+ * rightmost child of a leaf page unless that page is the only child, and it
+ * is deleting the parent page as well.
+ *
+ * If we didn't truncate the item at first/negative infinity offset to zero
+ * attributes during internal page splits then there would also be an unbroken
+ * seam of identical-to-root-pivot "low key" separator values on the left edge
+ * of the pages that are > the separator value; this alternative design would
+ * allow us to verify the same invariants directly, without ever having to
+ * cross more than one level of the tree (we'd still have to cross one level
+ * because leaf pages would still not store a low key directly, and we'd still
+ * need bitwise-equality cross checks of downlink separator in parent against
+ * the low keys in their non-leaf children).
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	/* No need to use bt_mkscankey_minusinfkey() here */
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal certain remaining inconsistencies.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		OffsetNumber offnum;
+		Page		page;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch(state->rel, key, lbuf);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v12-0004-Add-split-after-new-tuple-optimization.patchapplication/x-patch; name=v12-0004-Add-split-after-new-tuple-optimization.patchDownload

From ed66dc19791033ee5c3a530c60e537d5f137699c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v12 4/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values.  This can greatly help space utilization to split between two
groups of localized monotonically increasing values.

Without this patch, affected cases will reliably leave leaf pages no
more than about 50% full.  50/50 page splits are only appropriate with a
pattern of truly random insertions.  The optimization is very similar to
the long established fillfactor optimization used during rightmost page
splits, where we usually leave the new left side of the split 90% full.
Split-at-new-tuple page splits target essentially the same case. The
splits targeted are those at the rightmost point of a localized grouping
of values, rather than those at the rightmost point of the entire key
space.

This enhancement is very effective at avoiding index bloat when initial
bulk INSERTs for the TPC-C benchmark are run, and throughout the TPC-C
benchmark.  Localized monotonically increasing insertion patterns are
presumed to be fairly common in real-world applications.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsplitloc.c | 174 ++++++++++++++++++++++++
 1 file changed, 174 insertions(+)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 015591eb87..396737f4bd 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -61,6 +61,9 @@ static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
 static int _bt_checksplitloc(FindSplitData *state,
 				  OffsetNumber firstoldonright, bool newitemonleft,
 				  int dataitemstoleft, Size firstoldonrightsz);
+static bool _bt_splitafternewitemoff(Relation rel, Page page,
+						 int leaffillfactor, OffsetNumber newitemoff,
+						 IndexTuple newitem, double *propfullonleft);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state,
 				 int perfectpenalty,
@@ -71,6 +74,7 @@ static int _bt_perfect_penalty(Relation rel, Page page, SplitMode mode,
 					IndexTuple newitem, SplitMode *secondmode);
 static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 
 
 /*
@@ -254,6 +258,12 @@ _bt_dofindsplitloc(Relation rel,
 			state.propfullonleft = leaffillfactor / 100.0;
 			state.is_weighted = true;
 		}
+		else if (_bt_splitafternewitemoff(rel, page, leaffillfactor, newitemoff,
+										  newitem, &state.propfullonleft))
+		{
+			/* propfullonleft was set for us */
+			state.is_weighted = true;
+		}
 		else
 		{
 			/* propfullonleft won't be used, but be tidy */
@@ -555,6 +565,142 @@ _bt_checksplitloc(FindSplitData *state,
 	return INT_MAX;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  When
+ * the optimization is applied, the new/incoming tuple becomes the last tuple
+ * on the new left page.  (Actually, newitemoff > maxoff is usually the case
+ * that leads to applying the optimization in practice, so applying leaf
+ * fillfactor in the style of a rightmost leaf page split is the most common
+ * outcome.)
+ *
+ * This routine targets splits in composite indexes that consist of one or
+ * more leading columns that describe some grouping, plus a trailing column
+ * with ascending values.  This pattern is prevalent in many real world
+ * applications.  Consider the example of a composite index on (supplier_id,
+ * invoice_id), where there are a small, nearly-fixed number of suppliers, and
+ * invoice_id is an identifier assigned in ascending order (it doesn't matter
+ * whether or not suppliers are assigned invoice_id values from the same
+ * counter, or their own counter).  Without this optimization, approximately
+ * 50% of space in leaf pages will be wasted by unweighted/50:50 page splits.
+ * With this optimization, space utilization will be close to that of a
+ * similar index where all tuple insertions modify the current rightmost leaf
+ * page in the index.
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  Testing
+ * has shown the effect to be self-limiting; a future grouping that becomes
+ * the "nearest on the right" grouping of the affected grouping usually puts
+ * the extra free space to good use instead.
+ *
+ * Caller uses propfullonleft rather than using the new item offset directly
+ * because not all offsets will be deemed legal as split points.  This also
+ * allows us to apply leaf fillfactor in the common case where the new
+ * insertion is after the last offset.
+ */
+static bool
+_bt_splitafternewitemoff(Relation rel, Page page, int leaffillfactor,
+						 OffsetNumber newitemoff, IndexTuple newitem,
+						 double *propfullonleft)
+{
+	OffsetNumber maxoff;
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	Size		tupspace;
+	Size		hikeysize;
+	int			keepnatts;
+
+	Assert(!P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)));
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Proceed only when items on page look fairly short */
+	if (maxoff < MaxIndexTuplesPerPage / 2)
+		return false;
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are not all of uniform size,
+	 * with the exception of the high key (existing high key may be smaller
+	 * due to truncation).  Surmise that page has equisized tuples when page
+	 * layout is consistent with having maxoff-1 non-pivot tuples that are all
+	 * the same size as the newly inserted tuple.
+	 */
+	tupspace = ((PageHeader) page)->pd_special - ((PageHeader) page)->pd_upper;
+	itemid = PageGetItemId(page, P_HIKEY);
+	hikeysize = ItemIdGetLength(itemid);
+	if (IndexTupleSize(newitem) * (maxoff - 1) != tupspace - hikeysize)
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 */
+	if (newitemoff > maxoff)
+	{
+		/* Try to infer ascending insertion pattern */
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+		{
+			*propfullonleft = (double) leaffillfactor / 100.0;
+			return true;
+		}
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, try to infer ascending
+	 * insertion pattern.  We try to split at the precise point of the
+	 * insertion here, rather than applying leaf fillfactor.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index on '(country_id,
+	 * event_uuid)') may sometimes end up having the optimization applied
+	 * instead of getting a 50:50 (unweighted) page split.  This is
+	 * suboptimal.
+	 *
+	 * We're willing to accept that outcome when an incoming/new tuple is to
+	 * the right of all existing items on the page, since that's expected for
+	 * about one half of 1% of all page splits that occur in the index's
+	 * lifetime (assuming default BLCKSZ) with random insertions.  More care
+	 * must be taken here, where we consider splits involving the new item
+	 * being inserted at neither edge of the page: we proceed only when new
+	 * item's heap TID is "adjacent" to the heap TID of the existing tuple to
+	 * the immediate left of the offset for the new item.  Heap TID adjacency
+	 * strongly suggests that the item just to the left was inserted very
+	 * recently.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+	if (!_bt_adjacenthtid(&tup->t_tid, &newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		*propfullonleft = (double) newitemoff / (((double) maxoff + 1));
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
@@ -853,3 +999,31 @@ _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
 	Assert(lastleft != firstright);
 	return _bt_keep_natts_fast(rel, lastleft, firstright);
 }
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
-- 
2.17.1

v12-0005-Add-high-key-continuescan-optimization.patchapplication/x-patch; name=v12-0005-Add-high-key-continuescan-optimization.patchDownload

From e8b103c022ac51b172818c19c696443e083cccdb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v12 5/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 54a4c64304..5e0a33383a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1364,6 +1364,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1412,16 +1413,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 146de1b2e4..7c795c6bb6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1362,11 +1362,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1376,6 +1379,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1389,24 +1393,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1418,6 +1419,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1429,11 +1431,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1564,8 +1579,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1582,6 +1597,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v12-0003-Pick-nbtree-split-points-discerningly.patchapplication/x-patch; name=v12-0003-Pick-nbtree-split-points-discerningly.patchDownload

From a2ec7b69d41012f191c374de2c20639c99d8a00f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v12 3/7] Pick nbtree split points discerningly.

Add infrastructure to weigh how effective suffix truncation will be when
choosing a split point.  This should not noticeably affect the balance
of free space within each half of the split, while still making suffix
truncation truncate away significantly more attributes on average.

The logic for choosing a split point is also taught to care about the
case where there are many duplicates, making it hard to find a
distinguishing split point.  It may even conclude that the page being
split is already full of logical duplicates, in which case it packs the
left half very tightly, while leaving the right half mostly empty.  Our
assumption is that logical duplicates will almost always be inserted in
ascending heap TID order.  This strategy leaves most of the free space
on the half of the split that will likely be where future logical
duplicates of the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pre-pg_upgrade'd v3 indexes make use of these
optimizations.
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 299 +--------
 src/backend/access/nbtree/nbtsplitloc.c | 855 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 973 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index be9bf61d47..cdd68b6f75 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -657,6 +657,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7d481f0ff2..a444619091 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -324,7 +297,9 @@ top:
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
  * tuples.  This won't be usable if caller's tuple is determined to not belong
- * on buf following scantid being filled-in.
+ * on buf following scantid being filled-in, but that should be very rare in
+ * practice, since the logic for choosing a leaf split point works hard to
+ * avoid splitting within a group of duplicates.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -913,8 +888,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -1009,7 +983,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1345,6 +1319,11 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * to go into the new right page, or possibly a truncated version if this
 	 * is a leaf page split.  This might be either the existing data item at
 	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page (on leaf level).  Despite appearances, the new high key is
+	 * generated in a way that's consistent with their approach.  See comments
+	 * above _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1684,264 +1663,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..015591eb87
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* _bt_dofindsplitloc limits on suffix truncation split interval */
+#define MAX_LEAF_SPLIT_POINTS		9
+#define MAX_INTERNAL_SPLIT_POINTS	3
+
+typedef enum
+{
+	/* strategy to use for a call to FindSplitData */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} SplitMode;
+
+typedef struct
+{
+	/* FindSplitData candidate split */
+	int			delta;			/* size delta */
+	bool		newitemonleft;	/* new item on left or right of split */
+	OffsetNumber firstright;	/* split point */
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_checksplitloc */
+	Size		newitemsz;		/* size of new item to be inserted */
+	double		propfullonleft; /* want propfullonleft * leftfree on left */
+	int			gooddelta;		/* "good" left/right space delta cut-off */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_weighted;	/* T if propfullonleft used by split */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	int			maxsplits;		/* Maximum number of splits */
+	int			nsplits;		/* Current number of splits */
+	SplitPoint *splits;			/* Sorted by delta */
+} FindSplitData;
+
+static OffsetNumber _bt_dofindsplitloc(Relation rel, Page page,
+				   SplitMode mode, OffsetNumber newitemoff,
+				   Size newitemsz, IndexTuple newitem, bool *newitemonleft);
+static int _bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright, bool newitemonleft,
+				  int dataitemstoleft, Size firstoldonrightsz);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem, bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page, SplitMode mode,
+					FindSplitData *state, OffsetNumber newitemoff,
+					IndexTuple newitem, SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key on the leaf level.
+ * It isn't, though: suffix truncation will leave the left page's high key
+ * fully equal to the last item on the left page when two tuples with equal
+ * key values (excluding heap TID) enclose the split point.  It isn't
+ * necessary for a new leaf high key to be equal to the last item on the
+ * left for the L&Y "subtree" invariant to hold.  It's sufficient to make
+ * sure that the new leaf high key is strictly less than the first item on
+ * the right leaf page, and greater than the last item on the left page.
+ * When suffix truncation isn't possible, L&Y's exact approach to leaf
+ * splits is taken (actually, a tuple with all the keys from firstright but
+ * the heap TID from lastleft is formed, so as to not introduce a special
+ * case).
+ *
+ * Starting with the first right item minimizes the divergence between leaf
+ * and internal splits when checking if a candidate split point is legal.
+ * It is also inherently necessary for suffix truncation, since truncation
+ * is a subtractive process that specifically requires lastleft and
+ * firstright inputs.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	/* Initial call always uses SPLIT_DEFAULT */
+	return _bt_dofindsplitloc(rel, page, SPLIT_DEFAULT, newitemoff, newitemsz,
+							  newitem, newitemonleft);
+}
+
+/*
+ *	_bt_dofindsplitloc() -- guts of find split location code.
+ *
+ * We're always initially called in default mode, which is primarily
+ * concerned with equalizing available free space in each half of the split.
+ * However, a recursive invocation of _bt_dofindsplitloc() will follow in
+ * cases with a large number of duplicates around the space-optimal split
+ * point.
+ *
+ * We give some weight to suffix truncation in deciding a split point
+ * on leaf pages.  We try to select a point where a distinguishing attribute
+ * appears earlier in the new high key for the left side of the split, in
+ * order to maximize the number of trailing attributes that can be truncated
+ * away.  Initially, only candidate split points that imply an acceptable
+ * balance of free space on each side are considered.  This is even useful
+ * with pages that only have a single (non-TID) attribute, since it's
+ * helpful to avoid appending an explicit heap TID attribute to the new
+ * pivot tuple (high key/downlink) when it cannot actually be truncated.
+ *
+ * We do all we can to avoid having to append a heap TID in the new high
+ * key.  We may have to call ourselves recursively in many duplicates mode.
+ * This happens when a heap TID would otherwise be appended, but the page
+ * isn't completely full of logical duplicates (there may be a few as two
+ * distinct values).  Many duplicates mode has no hard requirements for
+ * space utilization, though it still keeps the use of space balanced as a
+ * non-binding secondary goal.  This significantly improves fan-out in
+ * practice, at least with most affected workloads.
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.
+ *
+ * When called recursively in single value mode we try to arrange to leave
+ * the left split page even more full than in the fillfactor% rightmost page
+ * case.  This maximizes space utilization in cases where tuples with the
+ * same attribute values span across many pages.  Newly inserted duplicates
+ * will tend to have higher heap TID values, so we'll end up splitting to
+ * the right in the manner of ascending insertions of monotonically
+ * increasing values.  See nbtree/README for more information about suffix
+ * truncation, and how a split point is chosen.
+ */
+static OffsetNumber
+_bt_dofindsplitloc(Relation rel,
+				   Page page,
+				   SplitMode mode,
+				   OffsetNumber newitemoff,
+				   Size newitemsz,
+				   IndexTuple newitem,
+				   bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	OffsetNumber offnum;
+	OffsetNumber maxoff;
+	ItemId		itemid;
+	FindSplitData state;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	bool		gooddeltafound;
+	SplitPoint	splits[MAX_LEAF_SPLIT_POINTS];
+	SplitMode	secondmode;
+	OffsetNumber finalfirstright;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items without actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	state.newitemsz = newitemsz + sizeof(ItemIdData);
+	state.is_leaf = P_ISLEAF(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+	state.splits = splits;
+	state.nsplits = 0;
+	if (!state.is_leaf)
+	{
+		Assert(mode == SPLIT_DEFAULT);
+
+		/* propfullonleft only used on rightmost page */
+		state.propfullonleft = BTREE_NONLEAF_FILLFACTOR / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		/* See is_leaf default mode remarks on maxsplits */
+		state.maxsplits = MAX_INTERNAL_SPLIT_POINTS;
+	}
+	else if (mode == SPLIT_DEFAULT)
+	{
+		if (P_RIGHTMOST(opaque))
+		{
+			/*
+			 * Rightmost page splits are always weighted.  Extreme contention
+			 * on the rightmost page is relatively common, so we treat it as a
+			 * special case.
+			 */
+			state.propfullonleft = leaffillfactor / 100.0;
+			state.is_weighted = true;
+		}
+		else
+		{
+			/* propfullonleft won't be used, but be tidy */
+			state.propfullonleft = 0.50;
+			state.is_weighted = false;
+		}
+
+		/*
+		 * Set an initial limit on the split interval/number of candidate
+		 * split points as appropriate.  The "Prefix B-Trees" paper refers to
+		 * this as sigma l for leaf splits and sigma b for internal ("branch")
+		 * splits.  It's hard to provide a theoretical justification for the
+		 * size of the split interval, though it's clear that a small split
+		 * interval improves space utilization.
+		 */
+		state.maxsplits = Min(Max(3, maxoff * 0.05), MAX_LEAF_SPLIT_POINTS);
+	}
+	else if (mode == SPLIT_MANY_DUPLICATES)
+	{
+		state.propfullonleft = leaffillfactor / 100.0;
+		state.is_weighted = P_RIGHTMOST(opaque);
+		state.maxsplits = maxoff + 2;
+		state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	}
+	else
+	{
+		Assert(mode == SPLIT_SINGLE_VALUE);
+
+		state.propfullonleft = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		state.is_weighted = true;
+		state.maxsplits = 1;
+	}
+
+	/*
+	 * Finding the best possible split would require checking all the possible
+	 * split points, because of the high-key and left-key special cases.
+	 * That's probably more work than it's worth outside of many duplicates
+	 * mode; instead, stop as soon as get past the "good" splits, where good
+	 * is defined as an imbalance in free space of no more than pagesize/16
+	 * (arbitrary...).  This should let us stop just past the middle on most
+	 * pages, instead of plowing to the end.  Many duplicates mode must
+	 * consider all possible choices, and so does not use this threshold for
+	 * anything (every delta is sufficiently good to be considered by many
+	 * duplicates mode).
+	 *
+	 * Note: Weighted candidate splits have weighted delta values that make
+	 * more splits appear to be "good".  A weighted search with a
+	 * propfullonleft of 0.5 is not quite identical to unweighted case.  It
+	 * will have delta values for candidate split points that are half those
+	 * of the corresponding candidate splits points for an unweighted search
+	 * of the same page, and so will consider more split points before
+	 * determining that remaining splits are no good, and falling out of the
+	 * loop.  It's very likely that a weighted split will need to go to the
+	 * end of the page anyway, though.
+	 */
+	if (mode != SPLIT_MANY_DUPLICATES)
+		state.gooddelta = leftspace / 16;
+	else
+		state.gooddelta = INT_MAX;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position
+	 */
+	olddataitemstoleft = 0;
+	gooddeltafound = false;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+		int			delta;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, true,
+									  olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		else
+		{
+			/* need to try it both ways! */
+			(void) _bt_checksplitloc(&state, offnum, true,
+									 olddataitemstoleft, itemsz);
+
+			delta = _bt_checksplitloc(&state, offnum, false,
+									  olddataitemstoleft, itemsz);
+		}
+
+		/* Record when good choice found */
+		if (state.nsplits > 0 && state.splits[0].delta <= state.gooddelta)
+			gooddeltafound = true;
+
+		/*
+		 * Abort scan once we've found at least one "good" choice, provided
+		 * we've reached the point where remaining candidates don't look good.
+		 */
+		if (gooddeltafound && delta > state.gooddelta)
+			break;
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, check for splitting so that all
+	 * the old items go to the left page and the new item goes to the right
+	 * page.
+	 */
+	if (newitemoff > maxoff && !gooddeltafound)
+		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Search among acceptable split points for the entry with the lowest
+	 * penalty.  See _bt_split_penalty() for the definition of penalty.  The
+	 * goal here is to choose a split point whose new high key is amenable to
+	 * being made smaller by suffix truncation, or is already small.
+	 *
+	 * First find lowest possible penalty among acceptable split points -- the
+	 * "perfect" penalty.  The perfect penalty often saves _bt_bestsplitloc()
+	 * additional work around calculating penalties.  This is also a
+	 * convenient point to determine if a second pass over page is required.
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, mode, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/* Perform second pass over page when _bt_perfect_penalty() tells us to */
+	if (secondmode != SPLIT_DEFAULT)
+		return _bt_dofindsplitloc(rel, page, secondmode, newitemoff,
+								  newitemsz, newitem, newitemonleft);
+
+	/*
+	 * Search among acceptable split points for the entry that has the lowest
+	 * penalty, and thus maximizes fan-out.  Sets *newitemonleft for us.
+	 */
+	finalfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	/* Be tidy */
+	if (state.splits != splits)
+		pfree(state.splits);
+
+	return finalfirstright;
+}
+
+/*
+ * Subroutine to analyze a particular possible split choice (ie, firstright
+ * and newitemonleft settings), and record the best split so far in *state.
+ *
+ * firstoldonright is the offset of the first item on the original page
+ * that goes to the right page, and firstoldonrightsz is the size of that
+ * tuple. firstoldonright can be > max offset, which means that all the old
+ * items go to the left page and only the new item goes to the right page.
+ * In that case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of
+ * firstoldonright.
+ *
+ * Returns delta between space that will be left free on left and right side
+ * of split.
+ */
+static int
+_bt_checksplitloc(FindSplitData *state,
+				  OffsetNumber firstoldonright,
+				  bool newitemonleft,
+				  int olddataitemstoleft,
+				  Size firstoldonrightsz)
+{
+	int			leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int) (firstrightitemsz +
+						   MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int) state->newitemsz;
+	else
+		rightfree -= (int) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int) firstrightitemsz -
+			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/*
+	 * If feasible split point with lower delta than that of most marginal
+	 * spit point so far, or we haven't run out of space for split points,
+	 * remember it.
+	 */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		int			delta;
+
+		if (state->is_weighted)
+			delta = state->propfullonleft * leftfree -
+				(1.0 - state->propfullonleft) * rightfree;
+		else
+			delta = leftfree - rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/*
+		 * Optimization: Don't recognize differences among marginal split
+		 * points that are unlikely to end up being used anyway
+		 */
+		if (delta > state->gooddelta)
+			delta = state->gooddelta + 1;
+		if (state->nsplits < state->maxsplits ||
+			delta < state->splits[state->nsplits - 1].delta)
+		{
+			SplitPoint	newsplit;
+			int			j;
+
+			newsplit.delta = delta;
+			newsplit.newitemonleft = newitemonleft;
+			newsplit.firstright = firstoldonright;
+
+			/*
+			 * Make space at the end of the state array for new candidate
+			 * split point if we haven't already reached the maximum number of
+			 * split points.
+			 */
+			if (state->nsplits < state->maxsplits)
+				state->nsplits++;
+
+			/*
+			 * Replace the final item in the nsplits-wise array.  The final
+			 * item is either a garbage still-uninitialized entry, or the most
+			 * marginal real entry when we already have as many split points
+			 * as we're willing to consider.
+			 */
+			for (j = state->nsplits - 1;
+				 j > 0 && state->splits[j - 1].delta > newsplit.delta;
+				 j--)
+			{
+				state->splits[j] = state->splits[j - 1];
+			}
+			state->splits[j] = newsplit;
+		}
+
+		return delta;
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+
+	/*
+	 * No point in calculating penalty when there's only one choice.  Note
+	 * that single value mode always has one choice.
+	 */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < state->nsplits; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point.  This may be lower than any real penalty for any of the
+ * candidate split points, in which case the optimization is ineffective.
+ * Split penalties are discrete rather than continuous, so an
+ * actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting
+ * the page using the default strategy, or, alternatively, to do a second pass
+ * over page using a different strategy.  (This only happens with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, SplitMode mode,
+					FindSplitData *state, OffsetNumber newitemoff,
+					IndexTuple newitem, SplitMode *secondmode)
+{
+	ItemId		itemid;
+	OffsetNumber center;
+	IndexTuple	leftmost,
+				rightmost;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Assume that a second pass over page won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * During a many duplicates pass over page, we settle for a "perfect"
+	 * split point that merely avoids appending a heap TID in new pivot.
+	 * Appending a heap TID is harmful enough to fan-out that it's worth
+	 * avoiding at all costs, but it doesn't make sense to go to those lengths
+	 * to also be able to truncate an extra, earlier attribute.
+	 *
+	 * Single value mode splits only occur when appending a heap TID was
+	 * already deemed necessary.  Don't waste any more cycles trying to avoid
+	 * that outcome.
+	 */
+	if (mode == SPLIT_MANY_DUPLICATES)
+		return indnkeyatts;
+	else if (mode == SPLIT_SINGLE_VALUE)
+		return indnkeyatts + 1;
+
+	/*
+	 * Complicated though common case -- leaf page default mode split.
+	 *
+	 * Iterate from the end of split array to the start, in search of the
+	 * firstright-wise leftmost and rightmost entries among acceptable split
+	 * points.  The split point with the lowest delta is at the start of the
+	 * array.  It is deemed to be the split point whose firstright offset is
+	 * at the center.  Split points with firstright offsets at both the left
+	 * and right extremes among acceptable split points will be found at the
+	 * end of caller's array.
+	 */
+	leftmost = NULL;
+	rightmost = NULL;
+	center = state->splits[0].firstright;
+
+	/*
+	 * Leaf split points can be thought of as points _between_ tuples on the
+	 * original unsplit page image, at least if you pretend that the incoming
+	 * tuple is already on the page to be split (imagine that the original
+	 * unsplit page actually had enough space to fit the incoming tuple).  The
+	 * rightmost tuple is the tuple that is immediately to the right of a
+	 * split point that is itself rightmost.  Likewise, the leftmost tuple is
+	 * the tuple to the left of the leftmost split point.
+	 *
+	 * When there are very few candidates, no sensible comparison can be made
+	 * here, resulting in caller selecting lowest delta/the center split point
+	 * by default.  Typically, leftmost and rightmost tuples will be located
+	 * almost immediately.
+	 */
+	perfectpenalty = indnkeyatts;
+	for (int j = state->nsplits - 1; j > 1; j--)
+	{
+		SplitPoint *split = state->splits + j;
+
+		if (!leftmost && split->firstright <= center)
+		{
+			if (split->newitemonleft && newitemoff == split->firstright)
+				leftmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page,
+									   OffsetNumberPrev(split->firstright));
+				leftmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (!rightmost && split->firstright >= center)
+		{
+			if (!split->newitemonleft && newitemoff == split->firstright)
+				rightmost = newitem;
+			else
+			{
+				itemid = PageGetItemId(page, split->firstright);
+				rightmost = (IndexTuple) PageGetItem(page, itemid);
+			}
+		}
+
+		if (leftmost && rightmost)
+		{
+			Assert(leftmost != rightmost);
+			perfectpenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+			break;
+		}
+	}
+
+	/*
+	 * Work out which type of second pass caller should perform, if any, when
+	 * even their "perfect" penalty fails to avoid appending a heap TID to new
+	 * pivot tuple
+	 */
+	if (perfectpenalty > indnkeyatts)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			origpagepenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode pass will be performed.  If page is entirely
+		 * full of duplicates and it appears that the duplicates have been
+		 * inserted in sequential order (i.e. heap TID order), a single value
+		 * mode pass will be performed.
+		 *
+		 * Deliberately ignore new item here, since a split that leaves only
+		 * one item on either page is often deemed unworkable by
+		 * _bt_checksplitloc().
+		 */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		leftmost = (IndexTuple) PageGetItem(page, itemid);
+		itemid = PageGetItemId(page, maxoff);
+		rightmost = (IndexTuple) PageGetItem(page, itemid);
+		origpagepenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+
+		if (origpagepenalty <= indnkeyatts)
+			*secondmode = SPLIT_MANY_DUPLICATES;
+		else if (P_RIGHTMOST(opaque))
+			*secondmode = SPLIT_SINGLE_VALUE;
+		else
+		{
+			itemid = PageGetItemId(page, P_HIKEY);
+			if (ItemIdGetLength(itemid) !=
+				IndexTupleSize(newitem) + MAXALIGN(sizeof(ItemPointerData)))
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				IndexTuple	hikey;
+
+				hikey = (IndexTuple) PageGetItem(page, itemid);
+				origpagepenalty = _bt_keep_natts_fast(rel, hikey, newitem);
+				if (origpagepenalty <= indnkeyatts)
+					*secondmode = SPLIT_SINGLE_VALUE;
+			}
+		}
+
+		/*
+		 * Have caller continue with original default mode split when new item
+		 * does not appear to be a duplicate that's inserted into the
+		 * rightmost page that duplicates can be found on (found by a scan
+		 * that omits scantid).  Evenly sharing space among each half of the
+		 * split avoids pathological performance.
+		 *
+		 * Note that single value mode should generally still be used when
+		 * duplicate insertions have heap TIDs that are slightly out of order.
+		 * That's probably due to concurrency.
+		 */
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	ItemId		itemid;
+	IndexTuple	firstright;
+	IndexTuple	lastleft;
+
+	if (!split->newitemonleft && newitemoff == split->firstright)
+		firstright = newitem;
+	else
+	{
+		itemid = PageGetItemId(page, split->firstright);
+		firstright = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	if (!is_leaf)
+		return IndexTupleSize(firstright);
+
+	if (split->newitemonleft && newitemoff == split->firstright)
+		lastleft = newitem;
+	else
+	{
+		OffsetNumber lastleftoff;
+
+		lastleftoff = OffsetNumberPrev(split->firstright);
+		itemid = PageGetItemId(page, lastleftoff);
+		lastleft = (IndexTuple) PageGetItem(page, itemid);
+	}
+
+	Assert(lastleft != firstright);
+	return _bt_keep_natts_fast(rel, lastleft, firstright);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 15090b26d2..146de1b2e4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2318,6 +2319,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e7293bbaec..83298120b0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -168,11 +168,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -681,6 +685,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -747,6 +758,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v12-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchapplication/x-patch; name=v12-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patchDownload

From f9ca88e28d2275397e00f5e67a7dc74e308f05fc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v12 2/7] Treat heap TID as part of the nbtree key space.

Make nbtree treat all index tuples as having a heap TID trailing key
attribute.  Heap TID becomes a first class part of the key space on all
levels of the tree.  Index searches can distinguish duplicates by heap
TID.  Non-unique index insertions will descend straight to the leaf page
that they'll insert on to (unless there is a concurrent page split).
This general approach has numerous benefits for performance, and is
prerequisite to teaching VACUUM to perform "retail index tuple
deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.  The new maximum allowed size is 2704 bytes on 64-bit
systems, down from 2712 bytes.
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 340 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 160 ++++---
 src/backend/access/nbtree/nbtinsert.c        | 308 +++++++------
 src/backend/access/nbtree/nbtpage.c          | 198 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 103 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 433 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 164 +++++--
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 28 files changed, 1545 insertions(+), 510 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 053ac9d192..a7d060b3ec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -137,17 +141,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_minusinfkey(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -204,6 +213,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -254,7 +264,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -324,8 +336,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -346,6 +358,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -806,7 +819,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -839,6 +853,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -865,7 +880,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -906,7 +922,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_minusinfkey(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -940,9 +1005,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -968,11 +1059,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1035,7 +1125,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1304,7 +1394,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_minusinfkey(state->rel, firstitup);
 }
 
 /*
@@ -1367,7 +1457,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1416,14 +1507,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1855,6 +1961,66 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  It is even capable of determining that a "minus
+	 * infinity value" from a "minusinfkey" scankey is equal to a pivot's
+	 * truncated attribute.  However, it is not capable of determining that a
+	 * scankey ("minusinfkey" or otherwise) is _less than_ a tuple on the
+	 * basis of a comparison resolved at _scankey_ minus infinity attribute.
+	 *
+	 * Somebody could teach _bt_compare() to handle this on its own, but that
+	 * would add overhead to index scans.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1874,42 +2040,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2065,3 +2273,61 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically sets insertion scankey to have
+ * minus infinity values for truncated attributes from itup (when itup is a
+ * pivot tuple with one or more truncated attributes).
+ *
+ * In a non-corrupt heapkeyspace index, all pivot tuples on a level have
+ * unique keys, so the !minusinfkey optimization correctly guides scans that
+ * aren't interested in relocating a leaf page using leaf page's high key
+ * (i.e. optimization can safely be used by the vast majority of all
+ * _bt_search() calls).  nbtree verification should always use "minusinfkey"
+ * semantics, though, because the !minusinfkey optimization might mask a
+ * problem in a corrupt index.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !minusinfkey tie-breaker might otherwise
+ * cause amcheck to conclude that the scankey is greater, missing index
+ * corruption.  It's unlikely that the same problem would not be caught some
+ * other way, but the !minusinfkey optimization has no upside for amcheck, so
+ * it seems sensible to always avoid it.
+ */
+static inline BTScanInsert
+bt_mkscankey_minusinfkey(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->minusinfkey = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 6a22b17203..53e43ce80e 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..be9bf61d47 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -595,36 +606,56 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
+insertion scankey uses a similar array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is at most one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey or built from a truncated pivot tuple, there might be
+fewer keys than index columns, indicating that we have no constraints for
+the remaining index columns.) After we have located the starting point of a
+scan, the original search scankey is consulted as each index entry is
+sequentially scanned to decide whether to return the entry and whether the
+scan can stop (see _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains all key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +668,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +695,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b3fbba276d..7d481f0ff2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -120,6 +122,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -225,12 +230,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -267,6 +273,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -275,12 +285,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
@@ -291,8 +301,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
-					   false);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
@@ -313,7 +323,8 @@ top:
  *
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
- * tuples.
+ * tuples.  This won't be usable if caller's tuple is determined to not belong
+ * on buf following scantid being filled-in.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -362,6 +373,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(itup_key->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -399,16 +411,14 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive. Just advance over killed items
+			 * as quickly as we can. We only apply _bt_isequal() when we get
+			 * to a non-killed item. Even those comparisons could be avoided
+			 * (in the common case where there is only one page to visit) by
+			 * reusing bounds, but just skipping dead items is sufficiently
+			 * effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -633,16 +643,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
  *		_bt_check_unique() callers arrange for their insertion scan key to
  *		save the progress of the last binary search performed.  No additional
@@ -685,28 +695,26 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -714,6 +722,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * The checkingunique (restorebinsrch) case may well have established
 		 * bounds within _bt_check_unique()'s binary search that preclude the
 		 * need for a further high key check.  This fastpath isn't used when
@@ -721,22 +736,33 @@ _bt_findinsertloc(Relation rel,
 		 * when it looks like the new item belongs last on the page, but it
 		 * might go on a later page instead.
 		 */
-		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
-			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+				 itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -745,6 +771,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
 			 * If this page was incompletely split, finish the split now. We
 			 * do this while holding a lock on the left sibling, which is not
 			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
 			 */
 			if (P_INCOMPLETE_SPLIT(lpageop))
 			{
@@ -814,6 +845,11 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
  *		This function handles the question of whether or not an insertion
  *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
  *		insert on the page contained in buf when a choice must be made.
@@ -904,6 +940,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -926,7 +963,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -976,8 +1013,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1059,7 +1096,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1181,17 +1220,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1286,7 +1327,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1300,8 +1342,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1319,25 +1362,58 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split within leaf pages.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1530,7 +1606,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1559,22 +1634,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1592,9 +1655,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1957,7 +2018,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1987,7 +2048,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2247,7 +2308,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2282,7 +2343,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2315,6 +2377,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2379,6 +2443,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2394,8 +2459,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2408,6 +2473,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7569a37b2e..9f54c28628 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,10 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is harmless and
- * indeed saves later search effort in _bt_pagedel.  The caller should
- * initialize *target and *rightsib to the leaf page and its right sibling.
+ * leading to it (it starts out as approximate in !heapkeyspace indexes, and
+ * needs to be re-checked in case it became stale in all cases).  Note that we
+ * will update the stack entry(s) to reflect current downlink positions ---
+ * this is harmless and indeed saves later search effort in _bt_pagedel.  The
+ * caller should initialize *target and *rightsib to the leaf page and its
+ * right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1421,9 +1492,20 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
+				/* absent attributes need explicit minus infinity values */
+				itup_key->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 
+				/*
+				 * Search will reliably relocate same leaf page.
+				 *
+				 * (However, prior to version 4 the search is for the leftmost
+				 * leaf page containing this key, which is okay because we
+				 * will tiebreak on downlink block number.)
+				 */
+				Assert(!itup_key->heapkeyspace ||
+					   BufferGetBlockNumber(buf) == BufferGetBlockNumber(lbuf));
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
@@ -1969,7 +2051,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2099,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7940297305..54a4c64304 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -358,6 +364,9 @@ _bt_binsrch(Relation rel,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	if (!key->restorebinsrch)
 	{
 		low = P_FIRSTDATAKEY(opaque);
@@ -367,6 +376,7 @@ _bt_binsrch(Relation rel,
 	else
 	{
 		/* Restore result of previous binary search against same page */
+		Assert(!key->heapkeyspace || key->scantid != NULL);
 		Assert(P_ISLEAF(opaque));
 		low = key->low;
 		high = key->stricthigh;
@@ -446,6 +456,7 @@ _bt_binsrch(Relation rel,
 	if (key->savebinsrch)
 	{
 		Assert(isleaf);
+		Assert(key->scantid == NULL);
 		key->low = low;
 		key->stricthigh = stricthigh;
 		key->savebinsrch = false;
@@ -492,19 +503,31 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(key->minusinfkey || key->heapkeyspace);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -518,8 +541,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -570,8 +595,65 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.  (This is also a convenient point to check if
+	 * the !minusinfkey optimization can be used.)
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches (all !minusinfkey searches) are not interested in
+		 * keys where minus infinity is explicitly represented, since that's a
+		 * sentinel value that never appears in non-pivot tuples.  It is safe
+		 * for these searches to have their scankey considered greater than a
+		 * truncated pivot tuple iff the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in pivot tuple.  The only would-be "match" that will be
+		 * "missed" is a single leaf page's high key (the leaf page whose high
+		 * key the values from affected pivot tuple originate from).
+		 *
+		 * This optimization prevents an extra leaf page visit when the
+		 * insertion scankey would otherwise be equal.  If this tiebreaker
+		 * wasn't performed, code like _bt_readpage() and _bt_readnextpage()
+		 * would often end up moving right having found no matches on the leaf
+		 * page that their search lands on initially.
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 */
+		if (!key->minusinfkey && key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1088,7 +1170,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scan key fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..67cdb44cf5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e010bcdcfa..15090b26d2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's minusinfkey field.  This can be
+ *		thought of as explicitly representing that absent attributes in scan
+ *		key have minus infinity values.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,15 +97,38 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate high keys in heapkeyspace indexes.  Caller also
+	 * opts out of the "no minus infinity key" optimization, so search moves
+	 * left on scankey-equal downlink in parent, allowing VACUUM caller to
+	 * reliably relocate leaf page undergoing deletion.
+	 */
+	key->minusinfkey = !key->heapkeyspace;
 	key->savebinsrch = key->restorebinsrch = false;
 	key->low = key->stricthigh = InvalidOffsetNumber;
 	key->nextkey = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -103,9 +144,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are defensively
-		 * represented as NULL values, though they should still not
-		 * participate in comparisons.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values,
+		 * though they should still not participate in comparisons.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2043,38 +2084,238 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a MAXALIGN()'d item
+	 * pointer.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2088,15 +2329,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2116,16 +2359,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2135,8 +2388,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2148,7 +2408,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2159,18 +2423,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index dc2eafb566..e7293bbaec 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,53 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
+ * general changes to the on-disk representation to add support for
+ * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
+ * semantics in pg_upgrade scenarios.  We continue to offer support for
+ * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
+ * up to and including v10 to v12+ without requiring a REINDEX.
+ * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
+ * support upgrades from v11 to v12+ without requiring a REINDEX.
+ *
+ * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
+ * to BTREE_NOVAC_VERSION automatically.  v11's "no vacuuming" enhancement
+ * (the ability to skip full index scans during vacuuming) only requires
+ * two new metapage fields, which makes it possible to upgrade at any
+ * point that the metapage must be updated anyway (e.g. during a root page
+ * split).  Note also that there happened to be no changes in metapage
+ * layout for btree version 4.  All current metapage fields should have
+ * valid values set when a metapage WAL record is replayed.
+ *
+ * It's convenient to consider the "no vacuuming" enhancement (metapage
+ * layout compatibility) separately from heapkeyspace semantics, since
+ * each issue affects different areas.  This is just a convention; in
+ * practice a heapkeyspace index is always also a "no vacuuming" index.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +238,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +279,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +297,46 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -326,25 +402,55 @@ typedef BTStackData *BTStack;
  * _bt_search.  For details on its mutable state, see _bt_binsrch and
  * _bt_findinsertloc.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.
+ * Searches that are not specifically interested in keys with the value minus
+ * infinity (all searches bar those performed by VACUUM for page deletion)
+ * apply the optimization by setting the field to false.  The optimization
+ * avoids unnecessarily reading the left sibling of the leaf page that
+ * matching tuples can appear on first.  Work is saved when the insertion
+ * scankey happens to search on all the untruncated "separator" key attributes
+ * for some pivot tuple, without also providing a key value for a remaining
+ * truncated-in-pivot-tuple attribute.  Reasoning about minus infinity values
+ * specifically allows this case to use a special tie-breaker, guiding search
+ * right instead of left on the next level down.  This is particularly likely
+ * to help in the common case where insertion scankey has no scantid but has
+ * values for all other attributes, especially with indexes that happen to
+ * have few distinct values (once heap TID is excluded) on each leaf page.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +458,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +691,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +745,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..a4cbdff283 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index 0bd48dc5a0..46c1002e5a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -128,28 +128,22 @@ select proname from pg_proc where proname like E'RI\\_FKey%del' order by 1;
 (5 rows)
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 4932869c19..baa4e515f7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3207,11 +3207,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 1d12b01068..06fe44d39a 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 21171f7762..f6be2b6b1f 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -66,32 +66,23 @@ set enable_bitmapscan to true;
 select proname from pg_proc where proname like E'RI\\_FKey%del' order by 1;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 59da6b6592..b2ea3bfedb 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1142,11 +1142,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
-- 
2.17.1

v12-0001-Refactor-nbtree-insertion-scankeys.patchapplication/x-patch; name=v12-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From 0d5787e7dc3f3b1af8a65ac27606aa930def1c2a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v12 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() now falls out of its loop
immediately in the common case where it's already clear that there
couldn't possibly be a duplicate.  More importantly, the new
_bt_check_unique() scheme makes it a lot easier to manage cached binary
search effort afterwards, from within _bt_findinsertloc().  This is
needed for the upcoming patch to make nbtree tuples unique by treating
heap TID as a final tie-breaker column.

Based on a suggestion by Andrey Lepikhov.
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/nbtinsert.c | 334 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  13 +-
 src/backend/access/nbtree/nbtsearch.c | 168 ++++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  98 +++-----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 ++++-
 8 files changed, 430 insertions(+), 320 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a767..053ac9d192 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -138,14 +138,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -837,8 +837,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1029,7 +1029,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1081,7 +1081,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1110,11 +1110,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1302,8 +1303,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1316,8 +1317,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1422,8 +1423,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1863,13 +1863,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1882,13 +1881,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1904,14 +1902,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5c2b8034f5..b3fbba276d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -97,7 +97,9 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.
+ *		don't actually insert.  If rel is a unique index, then every call
+ *		here is a checkingunique call (i.e. every call does a duplicate
+ *		check, though perhaps only a tentative check).
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -110,18 +112,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,7 +142,6 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -179,8 +176,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +215,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +239,12 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, itup_key, itup, heapRel, buf,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +271,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -287,10 +283,16 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains mutable state used
+		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start own binary search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
+									   itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
@@ -301,7 +303,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,9 +311,9 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
+ * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
+ * reuse most of the work of our initial binary search to find conflicting
+ * tuples.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -326,14 +328,14 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -349,9 +351,17 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Save binary search bounds.  Note that this is also used within
+	 * _bt_findinsertloc() later.
+	 */
+	itup_key->savebinsrch = true;
+	offset = _bt_binsrch(rel, itup_key, buf);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(itup_key->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,6 +374,26 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
+			 * consideration to items that are definitely duplicates in most
+			 * cases (though not when original page is empty, or when initial
+			 * offset is past the end of the original page, which may indicate
+			 * that we'll have to examine a second or subsequent page).
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, provided initial offset
+			 * isn't past end of the initial page (and provided page has at
+			 * least one item).
+			 */
+			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
+			{
+				Assert(itup_key->low >= P_FIRSTDATAKEY(opaque));
+				Assert(itup_key->low <= itup_key->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
@@ -378,7 +408,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			 * first, so that we didn't actually get to exit any sooner
 			 * anyway. So now we just advance over killed items as quickly as
 			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * item.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +421,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +582,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,39 +644,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +706,36 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * The checkingunique (restorebinsrch) case may well have established
+		 * bounds within _bt_check_unique()'s binary search that preclude the
+		 * need for a further high key check.  This fastpath isn't used when
+		 * there are no items on the existing page (other than high key), or
+		 * when it looks like the new item belongs last on the page, but it
+		 * might go on a later page instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
+		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +778,98 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * Perform microvacuuming of the page we're about to insert tuple on if it
+	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
+	 * forestall a page split, though.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first page locked is also where
+	 * new tuple should be inserted
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	itup_key->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_key, buf);
+	Assert(!itup_key->restorebinsrch);
+	Assert(!restorebinsrch || newitemoff == _bt_binsrch(rel, itup_key, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
+ *		insert on the page contained in buf when a choice must be made.
+ *		Preemptive microvacuuming is performed here when that could allow
+ *		caller to insert on to the page in buf.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right (caller
+ *		must always be able to still move right following call here).
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -1189,8 +1275,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user
-	 * data.
+	 * rightmost page case), all the items on the right half will be user data
+	 * (there is no existing high key that needs to be relocated to the new
+	 * right page).
 	 */
 	rightoff = P_HIKEY;
 
@@ -2311,24 +2398,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 1d72fe5408..7569a37b2e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1370,7 +1370,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1420,12 +1420,11 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..7940297305 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -346,27 +333,45 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When key.savebinsrch is set, modifies mutable fields
+ * of insertion scan key, so that a subsequent call where caller sets
+ * key.savebinsrch can reuse the low and strict high bound of original
+ * binary search.  Callers that use these fields directly must be
+ * prepared for the case where stricthigh isn't on the same page (it
+ * exceeds maxoff for the page), and the case where there are no items
+ * on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+		isleaf = P_ISLEAF(opaque);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		Assert(P_ISLEAF(opaque));
+		low = key->low;
+		high = key->stricthigh;
+		isleaf = true;
+	}
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -375,8 +380,19 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
+	{
+		if (key->savebinsrch)
+		{
+			Assert(isleaf);
+			/* Caller can't use stricthigh */
+			key->low = low;
+			key->stricthigh = high;
+		}
+		key->savebinsrch = false;
+		key->restorebinsrch = false;
 		return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,9 +406,12 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
+	if (!key->restorebinsrch)
+		high++;					/* establish the loop invariant for high */
+	key->restorebinsrch = false;
+	stricthigh = high;			/* high initially strictly higher */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,12 +419,21 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
 	}
 
 	/*
@@ -415,7 +443,14 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (key->savebinsrch)
+	{
+		Assert(isleaf);
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
+	}
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +463,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +485,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +518,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +605,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +852,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +882,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +914,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +961,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +982,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1085,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1124,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..e010bcdcfa 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,39 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->savebinsrch = key->restorebinsrch = false;
+	key->low = key->stricthigh = InvalidOffsetNumber;
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +101,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are defensively
+		 * represented as NULL values, though they should still not
+		 * participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +125,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a1..dc2eafb566 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.  For details on its mutable state, see _bt_binsrch and
+ * _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

#57

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#56)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Notable improvements in v12:

I've been benchmarking v12, once again using a slightly modified
BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1]https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan, since
the problems with index bloat don't take so long to manifest
themselves when the indexes are inserted into incrementally from the
very beginning. This benchmarking process took over 20 hours, with a
database that started off at about 90GB (700 TPC-C/BenchmarkSQL
warehouses were used). That easily exceeded available main memory on
my test server, which was 32GB. This is a pretty I/O bound workload,
and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB,
NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only
tablespace.

Importantly, I figured out that I should disable both hash joins and
merge joins with BenchmarkSQL, in order to force all joins to be
nested loop joins. Otherwise, the "stock level" transaction eventually
starts to use a hash join, even though that's about 10x slower than a
nestloop join (~4ms vs. ~40ms on this machine) -- the hash join
produces a lot of noise without really testing anything. It usually
takes a couple of hours before we start to get obviously-bad plans,
but it also usually takes about that long until the patch series
starts to noticeably overtake the master branch. I don't think that
TPC-C will ever benefit from using a hash join or a merge join, since
it's supposed to be a pure OLTP benchmark, and is a benchmark that
MySQL is known to do at least respectably-well on.

This is the first benchmark I've published that was considerably I/O
bound. There are significant improvements in performance across the
board, on every measure, though it takes several hours for that to
really show. The benchmark was not rate-limited. 16
clients/"terminals" are used throughout. There were 5 runs for master
and 5 for patch, interlaced, each lasting 2 hours. Initialization
occurred once, so it's expected that both databases will gradually get
larger across runs.

Summary (appears in same order as the execution of each run) -- each
run is 2 hours, so 20 hours total excluding initial load time (2 hours
* 5 runs for master + 2 hours * 5 runs for patch):

Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC
(NewOrders) = 39172.37
Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC
(NewOrders) = 39530.2

Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC
(NewOrders) = 33530.66
Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC
(NewOrders) = 36508.38 <-- 8.8% increase in tpmTOTAL/throughput

Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC
(NewOrders) = 30949.24
Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC
(NewOrders) = 34021.98 <-- 9.8% increase in tpmTOTAL/throughput

Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC
(NewOrders) = 31163.29
Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC
(NewOrders) = 31793.99

Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC
(NewOrders) = 28908.8
Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders)
= 30885.56 <-- 6.9% increase in tpmTOTAL/throughput

There were *also* significant reductions in transaction latency for
the patch -- see the full html reports in the provided tar archive for
full details (URL provided below). The html reports have nice SVG
graphs, generated by BenchmarkSQL using R -- one for transaction
throughput, and another for transaction latency. The overall picture
is that the patched version starts out ahead, and has a much more
gradual decline as the database becomes larger and more bloated.

Note also that the statistics collector stats show a *big* reduction
in blocks read into shared_buffers for the duration of these runs. For
example, here is what pg_stat_database shows for run 3 (I reset the
stats between runs):

master: blks_read = 78,412,640, blks_hit = 4,022,619,556
patch: blks_read = 70,033,583, blks_hit = 4,505,308,517 <-- 10.7%
reduction in blks_read/logical I/O

This suggests an indirect benefit, likely related to how buffers are
evicted in each case. pg_stat_bgwriter indicates that more buffers are
written out during checkpoints, while fewer are written out by
backends. I won't speculate further on what all of this means right
now, though.

You can find the raw details for blks_read for each and every run in
the full tar archive. It is available for download from:

https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing

There are also dumps of the other pg_stat* views at the end of each
run, logs for each run, etc. There's more information than anybody
else is likely to find interesting.

If anyone needs help in recreating this benchmark, then I'd be happy
to assist in that. The is a shell script (zsh) included in the tar
archive, although that will need to be changed a bit to point to the
correct installations and so on. Independent validation of the
performance of the patch series on this and other benchmarks is very
welcome.

[1]: https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan
--
Peter Geoghegan

#58

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#53)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different.

Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once. Some of the code is directly lifted from your prototype, so
there is now a question about whether or not you should be listed as a
co-author. (I think that you should be credited as a secondary author
of the nbtsplitloc.c patch, and as a secondary author in the release
notes for the feature as a whole, which I imagine will be rolled into
one item there.)

I always knew that a "top down" approach would be simpler, but I
underestimated how much better it would be overall, and how manageable
the downsides are -- the added cycles are not actually noticeable when
compared to the master branch, even during microbenchmarks. Thanks for
suggesting this approach!

I don't even need to simulate recursion with a loop or a goto;
everything is structured as a linear series of steps now. There are
still the same modes as before, though; the algorithm is essentially
unchanged. All of my tests show that it's at least as effective as v12
was in terms of how effective the final _bt_findsplitloc() results are
in reducing index size. The new approach will make more sophisticated
suffix truncation costing much easier to implement in a future
release, when suffix truncation is taught to truncate *within*
individual datums/attributes (e.g. generate the text string "new m"
given a split point between "new jersey" and "new york", by using some
new opclass infrastructure). "Top down" also makes the implementation
of the "split after new item" optimization safer and simpler -- we
already have all split points conveniently available, so we can seek
out an exact match instead of interpolating where it ought appear
later using a dynamic fillfactor. We can back out of the "split after
new item" optimization in the event of the *precise* split point we
want to use not being legal. That shouldn't be necessary, and isn't
necessary in practice, but it seems like a good idea be defensive with
something so delicate as this.

I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?

Performance is at least as good as v12 with a more relevant workload,
such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my
most recent tests (benchmarks for v13 specifically).

Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)

Other changes from your prototype:

* I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

* Fixed a bug in the loop that adds split points. Your refactoring
made the main loop responsible for new item space handling, as just
mentioned, but it didn't create a split where the new item is first on
the page, and the split puts the new item on the left page on its own,
on all existing items on the new right page. I couldn't prove that
this caused failures to find a legal split, but it still seemed like a
bug.

In general, I think that we should generate our initial list of split
points in exactly the same manner as we do so already. The only
difference as far as split legality/feasibility goes is that we
pessimistically assume that suffix truncation will have to add a heap
TID in all cases. I don't see any advantage to going further than
that.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

* Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's
page deletion. Even a page that is about to be deleted can be filled
up again and split when we release and reacquire a lock, however
unlikely that may be.

* Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it
makes more sense to make that about recording a split point, rather
than about making sure a split point is legal. It still does that, but
perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result
in the split being deemed legal, so the new name makes much more
sense.

--
Peter Geoghegan

Attachments:

v13-0003-Consider-secondary-factors-during-nbtree-splits.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From 9926bd4a937ccfeaa19e412064d9c46c39b42792 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v13 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pre-pg_upgrade'd v3 indexes make use of these
optimizations.  Benchmarking has shown that even v3 indexes benefit,
despite the fact that suffix truncation will only truncate non-key
attributes in INCLUDE indexes.  Grouping relatively similar tuples
together is beneficial in and of itself, since it reduces the number of
leaf pages that must be accessed by subsequent index scans.
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 299 +--------
 src/backend/access/nbtree/nbtsplitloc.c | 779 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 897 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index be9bf61d47..cdd68b6f75 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -657,6 +657,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller and less
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, less discriminating
+pivot tuples end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 6f1d179c67..c666698c1e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -324,7 +297,9 @@ top:
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
  * tuples.  This won't be usable if caller's tuple is determined to not belong
- * on buf following scantid being filled-in.
+ * on buf following scantid being filled-in, but that should be very rare in
+ * practice, since the logic for choosing a leaf split point works hard to
+ * avoid splitting within a group of duplicates.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -913,8 +888,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -1009,7 +983,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1345,6 +1319,11 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * to go into the new right page, or possibly a truncated version if this
 	 * is a leaf page split.  This might be either the existing data item at
 	 * position firstright, or the incoming tuple.
+	 *
+	 * Lehman and Yao use the last left item as the new high key for the left
+	 * page (on leaf level).  Despite appearances, the new high key is
+	 * generated in a way that's consistent with their approach.  See comments
+	 * above _bt_findsplitloc for an explanation.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1684,264 +1663,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..6027b1f1a1
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,779 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval */
+#define LEAF_SPLIT_INTERVAL			9
+#define INTERNAL_SPLIT_INTERVAL		3
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} SplitMode;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int			curdelta;		/* current leftfree/rightfree delta */
+	int			leftfree;		/* space left on left page post-split */
+	int			rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recordsplit */
+	Size		newitemsz;		/* size of new item to be inserted */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			splitinterval;	/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recordsplit(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
+				 FindSplitData *state, int perfectpenalty,
+				 OffsetNumber newitemoff, IndexTuple newitem,
+				 bool *newitemonleft);
+static int _bt_perfect_penalty(Relation rel, Page page,
+					FindSplitData *state, OffsetNumber newitemoff,
+					IndexTuple newitem, SplitMode *secondmode);
+static int _bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf);
+static inline IndexTuple _bt_split_lastleft(Page page, SplitPoint *split,
+				   IndexTuple newitem, OffsetNumber newitemoff);
+static inline IndexTuple _bt_split_firstright(Page page, SplitPoint *split,
+					 IndexTuple newitem, OffsetNumber newitemoff);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ *
+ * The high key for the left page is formed using the first item on the
+ * right page, which may seem to be contrary to Lehman & Yao's approach of
+ * using the left page's last item as its new high key on the leaf level.
+ * It isn't, though: suffix truncation will leave the left page's high key
+ * fully equal to the last item on the left page when two tuples with equal
+ * key values (excluding heap TID) enclose the split point.  It isn't
+ * necessary for a new leaf high key to be equal to the last item on the
+ * left for the L&Y "subtree" invariant to hold.  It's sufficient to make
+ * sure that the new leaf high key is strictly less than the first item on
+ * the right leaf page, and greater than the last item on the left page.
+ * When suffix truncation isn't possible, L&Y's exact approach to leaf
+ * splits is taken (actually, a tuple with all the keys from firstright but
+ * the heap TID from lastleft is formed, so as to not introduce a special
+ * case).
+ *
+ * Starting with the first right item minimizes the divergence between leaf
+ * and internal splits when checking if a candidate split point is legal.
+ * It is also inherently necessary for suffix truncation, since truncation
+ * is a subtractive process that specifically requires lastleft and
+ * firstright inputs.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	SplitMode	secondmode;
+	double		fillfactormult;
+	bool		usemult;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * Allocate work space for candidate split points.  Round up allocation to
+	 * BLCKSZ, so that palloc() will be able to recycle block later on, when a
+	 * temp buffer is used by _bt_split().  The work space would almost be a
+	 * full BLCKSZ even without this optimization.
+	 *
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all splits
+	 * will be legal).  Even still, add space for an extra two splits out of
+	 * sheer paranoia.
+	 */
+	state.maxsplits = maxoff + 2;
+	state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (otherwise, a split whose firstoldonright is the first data offset
+	 * can't be legal, and won't actually end up being recorded by
+	 * _bt_recordsplit).
+	 *
+	 * Still, it's typical for almost all calls to _bt_recordsplit to
+	 * determine that the split is legal, and therefore enter it into the
+	 * candidate split point array for later consideration.
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recordsplit(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recordsplit(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recordsplit(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recordsplit(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.  We
+	 * always record every possible split point.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recordsplit(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default mode), while applying
+	 * rightmost optimization where appropriate.  Either of the two other
+	 * fallback modes may be required for cases with a large number of
+	 * duplicates around the original/space-optimal split point.
+	 *
+	 * Default mode gives some weight to suffix truncation in deciding a split
+	 * point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = P_RIGHTMOST(opaque);
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (P_RIGHTMOST(opaque))
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		fillfactormult = 0.50;
+	}
+
+	/* Give split points a delta, based on fillfactormult, and sort */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	if (!state.is_leaf)
+		state.splitinterval = INTERNAL_SPLIT_INTERVAL;
+	else
+		state.splitinterval = Min(Max(3, maxoff * 0.05), LEAF_SPLIT_INTERVAL);
+
+	/*
+	 * Find lowest possible penalty among split points currently regarded as
+	 * acceptable -- the "perfect" penalty.  The perfect penalty often saves
+	 * _bt_bestsplitloc() additional work around calculating penalties.  This
+	 * is also a convenient point to determine if default mode worked out, or
+	 * if we should instead reassess which split points should be considered
+	 * acceptable (split interval, and possibly fillfactormult).
+	 */
+	perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
+										 newitem, &secondmode);
+
+	/*
+	 * Reconsider strategy when call to _bt_perfect_penalty() tells us to,
+	 * using the second mode it indicated.  We do all we can to avoid having
+	 * to append a heap TID in the new high key when default mode and its
+	 * initial fillfactormult won't be able to avoid that, including enlarging
+	 * split interval to consider all possible split points.
+	 *
+	 * Many duplicates mode may be used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates
+	 * (there may be a few as two distinct values).  The split interval is
+	 * widened to include all possible candidate split points.  Many
+	 * duplicates mode has no hard requirements for space utilization, though
+	 * it still keeps the use of space balanced as a non-binding secondary
+	 * goal (perfect penalty is set so that the first/lowest delta split
+	 * points that's minimally distinguishing is used).
+	 *
+	 * Single value mode is used when it is impossible to avoid appending a
+	 * heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently. (Single value mode is harmless though not particularly
+	 * useful with !heapkeyspace indexes.)
+	 */
+	if (secondmode == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.splitinterval = state.nsplits;
+		/* Settle for lowest delta split that avoids appending heap TID */
+		perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel);
+	}
+	else if (secondmode == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split towards the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		state.splitinterval = 1;
+		/* Accept that appending a heap TID is inevitable */
+		perfectpenalty = IndexRelationGetNumberOfKeyAttributes(rel) + 1;
+	}
+	else
+	{
+		/* Common case: default mode worked out, or internal page */
+		Assert(secondmode == SPLIT_DEFAULT);
+		/* Original perfectpenalty still stands */
+	}
+
+	/*
+	 * Search among acceptable split points (the first splitinterval points)
+	 * for the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(rel, page, &state, perfectpenalty,
+									   newitemoff, newitem, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recordsplit(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int			leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int) (firstrightitemsz +
+						   MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int) state->newitemsz;
+	else
+		rightfree -= (int) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int) firstrightitemsz -
+			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		/* 2 extra items in maxsplits shouldn't be necessary */
+		Assert(state->nsplits < state->maxsplits - 2);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int			delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty, which is an
+ * abstract idea whose definition varies depending on whether we're splitting
+ * at the leaf level, or an internal level.  See _bt_split_penalty() for the
+ * definition.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice.  This optimization is
+ * important for several common cases, including insertion into a primary key
+ * index on an auto-incremented or monotonically increasing integer column.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating if new item is on left of split
+ * point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(Relation rel,
+				 Page page,
+				 FindSplitData *state,
+				 int perfectpenalty,
+				 OffsetNumber newitemoff,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->splitinterval, state->nsplits);
+
+	/*
+	 * No point in calculating penalty when there's only one choice.  Note
+	 * that single value mode always has one choice.
+	 */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(rel, page, newitemoff, newitem,
+									state->splits + i, state->is_leaf);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to find the lowest possible penalty for any acceptable candidate
+ * split point when still in default mode.  This may be lower than any real
+ * penalty for any of the candidate split points, in which case the
+ * optimization is ineffective.  Split penalties are discrete rather than
+ * continuous, so an actually-obtainable penalty is common.
+ *
+ * This is also a convenient point to decide to either finish splitting the
+ * page using default mode, or, alternatively, to consider alternative modes.
+ * (This can only happen with leaf pages.)
+ */
+static int
+_bt_perfect_penalty(Relation rel, Page page, FindSplitData *state,
+					OffsetNumber newitemoff, IndexTuple newitem,
+					SplitMode *secondmode)
+{
+	ItemId		itemid;
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *low,
+			   *high;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	/* Assume that alternative-mode split won't be required for now */
+	*secondmode = SPLIT_DEFAULT;
+
+	/*
+	 * There are a much smaller number of candidate split points when
+	 * splitting an internal page, so we can afford to be exhaustive.  Only
+	 * give up when pivot that will be inserted into parent is as small as
+	 * possible.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * Use leftmost and rightmost tuples within current acceptable range of
+	 * split points (using current split interval).
+	 */
+	low = state->splits;
+	high = state->splits + (Min(state->splitinterval, state->nsplits) - 1);
+	leftmost = _bt_split_lastleft(page, low, newitem, newitemoff);
+	rightmost = _bt_split_firstright(page, high, newitem, newitemoff);
+	perfectpenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+
+	/*
+	 * Work out which type of second pass caller should perform, if any, when
+	 * even their "perfect" penalty fails to avoid appending a heap TID to new
+	 * pivot tuple
+	 */
+	if (perfectpenalty > indnkeyatts)
+	{
+		BTPageOpaque opaque;
+		OffsetNumber maxoff;
+		int			origpagepenalty;
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If page has many duplicates but is not entirely full of duplicates,
+		 * a many duplicates mode split will be performed.  If page is
+		 * entirely full of duplicates and it appears that the duplicates have
+		 * been inserted in sequential order (i.e. heap TID order), a single
+		 * value mode split will be performed.
+		 *
+		 * Deliberately ignore new item here, since a split that leaves only
+		 * one item on either page is often deemed unworkable by
+		 * _bt_recordsplit().
+		 */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		leftmost = (IndexTuple) PageGetItem(page, itemid);
+		itemid = PageGetItemId(page, maxoff);
+		rightmost = (IndexTuple) PageGetItem(page, itemid);
+		origpagepenalty = _bt_keep_natts_fast(rel, leftmost, rightmost);
+
+		if (origpagepenalty <= indnkeyatts)
+			*secondmode = SPLIT_MANY_DUPLICATES;
+		else if (P_RIGHTMOST(opaque))
+			*secondmode = SPLIT_SINGLE_VALUE;
+		else
+		{
+			itemid = PageGetItemId(page, P_HIKEY);
+			if (ItemIdGetLength(itemid) !=
+				IndexTupleSize(newitem) + MAXALIGN(sizeof(ItemPointerData)))
+				*secondmode = SPLIT_SINGLE_VALUE;
+			else
+			{
+				IndexTuple	hikey;
+
+				hikey = (IndexTuple) PageGetItem(page, itemid);
+				origpagepenalty = _bt_keep_natts_fast(rel, hikey, newitem);
+				if (origpagepenalty <= indnkeyatts)
+					*secondmode = SPLIT_SINGLE_VALUE;
+			}
+		}
+
+		/*
+		 * Have caller finish split using default mode when new item does not
+		 * appear to be a duplicate that's inserted into the rightmost page
+		 * that duplicates can be found on (found by a scan that omits
+		 * scantid).  Evenly sharing space among each half of the split avoids
+		 * pathological performance.
+		 *
+		 * Note that single value mode should generally still be used when
+		 * duplicate insertions have heap TIDs that are slightly out of order.
+		 * That's probably due to concurrency.
+		 */
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.
+ */
+static int
+_bt_split_penalty(Relation rel, Page page, OffsetNumber newitemoff,
+				  IndexTuple newitem, SplitPoint *split, bool is_leaf)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	firstrighttuple = _bt_split_firstright(page, split, newitem, newitemoff);
+
+	if (!is_leaf)
+		return IndexTupleSize(firstrighttuple);
+
+	lastleftuple = _bt_split_lastleft(page, split, newitem, newitemoff);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(Page page, SplitPoint *split, IndexTuple newitem,
+				   OffsetNumber newitemoff)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == newitemoff)
+		return newitem;
+
+	itemid = PageGetItemId(page, OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(Page page, SplitPoint *split, IndexTuple newitem,
+					 OffsetNumber newitemoff)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == newitemoff)
+		return newitem;
+
+	itemid = PageGetItemId(page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 15090b26d2..146de1b2e4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2318,6 +2319,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c24b9a7c37..31815c63fe 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -168,11 +168,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -681,6 +685,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -747,6 +758,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v13-0001-Refactor-nbtree-insertion-scankeys.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From 77f4117cb16aa94cc19343a28db2398e8b32f53b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v13 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() now falls out of its loop
immediately in the common case where it's already clear that there
couldn't possibly be a duplicate.  More importantly, the new
_bt_check_unique() scheme makes it a lot easier to manage cached binary
search effort afterwards, from within _bt_findinsertloc().  This is
needed for the upcoming patch to make nbtree tuples unique by treating
heap TID as a final tie-breaker column.

Based on a suggestion by Andrey Lepikhov.
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/nbtinsert.c | 334 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 168 ++++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  98 +++-----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 ++++-
 8 files changed, 429 insertions(+), 320 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a767..053ac9d192 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -138,14 +138,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -837,8 +837,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1029,7 +1029,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1081,7 +1081,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1110,11 +1110,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1302,8 +1303,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1316,8 +1317,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1422,8 +1423,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1863,13 +1863,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1882,13 +1881,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1904,14 +1902,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a4d46a0f44..3c968bad89 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -97,7 +97,9 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.
+ *		don't actually insert.  If rel is a unique index, then every call
+ *		here is a checkingunique call (i.e. every call does a duplicate
+ *		check, though perhaps only a tentative check).
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -110,18 +112,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,7 +142,6 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -179,8 +176,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +215,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +239,12 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, itup_key, itup, heapRel, buf,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +271,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -287,10 +283,16 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains mutable state used
+		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start own binary search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
+									   itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
@@ -301,7 +303,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,9 +311,9 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
+ * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
+ * reuse most of the work of our initial binary search to find conflicting
+ * tuples.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -326,14 +328,14 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -349,9 +351,17 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Save binary search bounds.  Note that this is also used within
+	 * _bt_findinsertloc() later.
+	 */
+	itup_key->savebinsrch = true;
+	offset = _bt_binsrch(rel, itup_key, buf);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(itup_key->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,6 +374,26 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
+			 * consideration to items that are definitely duplicates in most
+			 * cases (though not when original page is empty, or when initial
+			 * offset is past the end of the original page, which may indicate
+			 * that we'll have to examine a second or subsequent page).
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, provided initial offset
+			 * isn't past end of the initial page (and provided page has at
+			 * least one item).
+			 */
+			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
+			{
+				Assert(itup_key->low >= P_FIRSTDATAKEY(opaque));
+				Assert(itup_key->low <= itup_key->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
@@ -378,7 +408,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			 * first, so that we didn't actually get to exit any sooner
 			 * anyway. So now we just advance over killed items as quickly as
 			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * item.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +421,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +582,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,39 +644,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +706,36 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * The checkingunique (restorebinsrch) case may well have established
+		 * bounds within _bt_check_unique()'s binary search that preclude the
+		 * need for a further high key check.  This fastpath isn't used when
+		 * there are no items on the existing page (other than high key), or
+		 * when it looks like the new item belongs last on the page, but it
+		 * might go on a later page instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
+		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +778,98 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * Perform microvacuuming of the page we're about to insert tuple on if it
+	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
+	 * forestall a page split, though.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first page locked is also where
+	 * new tuple should be inserted
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	itup_key->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_key, buf);
+	Assert(!itup_key->restorebinsrch);
+	Assert(!restorebinsrch || newitemoff == _bt_binsrch(rel, itup_key, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
+ *		insert on the page contained in buf when a choice must be made.
+ *		Preemptive microvacuuming is performed here when that could allow
+ *		caller to insert on to the page in buf.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right (caller
+ *		must always be able to still move right following call here).
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -1189,8 +1275,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user
-	 * data.
+	 * rightmost page case), all the items on the right half will be user data
+	 * (there is no existing high key that needs to be relocated to the new
+	 * right page).
 	 */
 	rightoff = P_HIKEY;
 
@@ -2312,24 +2399,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..7940297305 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -346,27 +333,45 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When key.savebinsrch is set, modifies mutable fields
+ * of insertion scan key, so that a subsequent call where caller sets
+ * key.savebinsrch can reuse the low and strict high bound of original
+ * binary search.  Callers that use these fields directly must be
+ * prepared for the case where stricthigh isn't on the same page (it
+ * exceeds maxoff for the page), and the case where there are no items
+ * on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+		isleaf = P_ISLEAF(opaque);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		Assert(P_ISLEAF(opaque));
+		low = key->low;
+		high = key->stricthigh;
+		isleaf = true;
+	}
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -375,8 +380,19 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
+	{
+		if (key->savebinsrch)
+		{
+			Assert(isleaf);
+			/* Caller can't use stricthigh */
+			key->low = low;
+			key->stricthigh = high;
+		}
+		key->savebinsrch = false;
+		key->restorebinsrch = false;
 		return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,9 +406,12 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
+	if (!key->restorebinsrch)
+		high++;					/* establish the loop invariant for high */
+	key->restorebinsrch = false;
+	stricthigh = high;			/* high initially strictly higher */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,12 +419,21 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
 	}
 
 	/*
@@ -415,7 +443,14 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (key->savebinsrch)
+	{
+		Assert(isleaf);
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
+	}
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +463,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +485,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +518,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +605,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +852,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +882,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +914,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +961,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +982,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1085,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1124,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..e010bcdcfa 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,39 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->savebinsrch = key->restorebinsrch = false;
+	key->low = key->stricthigh = InvalidOffsetNumber;
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +101,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are defensively
+		 * represented as NULL values, though they should still not
+		 * participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +125,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..950a61958d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.  For details on its mutable state, see _bt_binsrch and
+ * _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v13-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchDownload

From fe015798d426a1236801627fed0d68330ce1545b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v13 2/7] Make heap TID a tie-breaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.  The new maximum allowed size is 2704 bytes on 64-bit
systems, down from 2712 bytes.
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 340 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 160 ++++---
 src/backend/access/nbtree/nbtinsert.c        | 308 +++++++------
 src/backend/access/nbtree/nbtpage.c          | 196 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 103 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 433 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 164 +++++--
 src/include/access/nbtxlog.h                 |  20 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 28 files changed, 1540 insertions(+), 513 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 053ac9d192..a7d060b3ec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -137,17 +141,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_minusinfkey(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -204,6 +213,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -254,7 +264,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -324,8 +336,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -346,6 +358,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -806,7 +819,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -839,6 +853,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -865,7 +880,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -906,7 +922,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_minusinfkey(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -940,9 +1005,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -968,11 +1059,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1035,7 +1125,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1304,7 +1394,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_minusinfkey(state->rel, firstitup);
 }
 
 /*
@@ -1367,7 +1457,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1416,14 +1507,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1855,6 +1961,66 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  It is even capable of determining that a "minus
+	 * infinity value" from a "minusinfkey" scankey is equal to a pivot's
+	 * truncated attribute.  However, it is not capable of determining that a
+	 * scankey ("minusinfkey" or otherwise) is _less than_ a tuple on the
+	 * basis of a comparison resolved at _scankey_ minus infinity attribute.
+	 *
+	 * Somebody could teach _bt_compare() to handle this on its own, but that
+	 * would add overhead to index scans.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1874,42 +2040,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2065,3 +2273,61 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically sets insertion scankey to have
+ * minus infinity values for truncated attributes from itup (when itup is a
+ * pivot tuple with one or more truncated attributes).
+ *
+ * In a non-corrupt heapkeyspace index, all pivot tuples on a level have
+ * unique keys, so the !minusinfkey optimization correctly guides scans that
+ * aren't interested in relocating a leaf page using leaf page's high key
+ * (i.e. optimization can safely be used by the vast majority of all
+ * _bt_search() calls).  nbtree verification should always use "minusinfkey"
+ * semantics, though, because the !minusinfkey optimization might mask a
+ * problem in a corrupt index.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !minusinfkey tie-breaker might otherwise
+ * cause amcheck to conclude that the scankey is greater, missing index
+ * corruption.  It's unlikely that the same problem would not be caught some
+ * other way, but the !minusinfkey optimization has no upside for amcheck, so
+ * it seems sensible to always avoid it.
+ */
+static inline BTScanInsert
+bt_mkscankey_minusinfkey(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->minusinfkey = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 6a22b17203..53e43ce80e 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -475,7 +475,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..be9bf61d47 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -595,36 +606,56 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
+insertion scankey uses a similar array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is at most one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey or built from a truncated pivot tuple, there might be
+fewer keys than index columns, indicating that we have no constraints for
+the remaining index columns.) After we have located the starting point of a
+scan, the original search scankey is consulted as each index entry is
+sequentially scanned to decide whether to return the entry and whether the
+scan can stop (see _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains all key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +668,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +695,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 3c968bad89..6f1d179c67 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -120,6 +122,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -225,12 +230,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -267,6 +273,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -275,12 +285,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
@@ -291,8 +301,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
-					   false);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
@@ -313,7 +323,8 @@ top:
  *
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
- * tuples.
+ * tuples.  This won't be usable if caller's tuple is determined to not belong
+ * on buf following scantid being filled-in.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -362,6 +373,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(itup_key->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -399,16 +411,14 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive. Just advance over killed items
+			 * as quickly as we can. We only apply _bt_isequal() when we get
+			 * to a non-killed item. Even those comparisons could be avoided
+			 * (in the common case where there is only one page to visit) by
+			 * reusing bounds, but just skipping dead items is sufficiently
+			 * effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -633,16 +643,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple unambiguously
+ *		belongs on.  This may not be quite right for callers that just called
+ *		_bt_check_unique(), though, since they won't have initially searched
+ *		using a scantid.  They'll have to insert into a page somewhere to the
+ *		right in rare cases where there are many physical duplicates in a
+ *		unique index, and their scantid directs us to some page full of
+ *		duplicates to the right, where the new tuple must go.  (Actually,
+ *		since !heapkeyspace pg_upgraded'd non-unique indexes never get a
+ *		scantid, they too may require that we move right.  We treat them
+ *		somewhat like unique indexes.)
  *
  *		_bt_check_unique() callers arrange for their insertion scan key to
  *		save the progress of the last binary search performed.  No additional
@@ -685,28 +695,26 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -714,6 +722,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * The checkingunique (restorebinsrch) case may well have established
 		 * bounds within _bt_check_unique()'s binary search that preclude the
 		 * need for a further high key check.  This fastpath isn't used when
@@ -721,22 +736,33 @@ _bt_findinsertloc(Relation rel,
 		 * when it looks like the new item belongs last on the page, but it
 		 * might go on a later page instead.
 		 */
-		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
-			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+				 itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -745,6 +771,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
 			 * If this page was incompletely split, finish the split now. We
 			 * do this while holding a lock on the left sibling, which is not
 			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
 			 */
 			if (P_INCOMPLETE_SPLIT(lpageop))
 			{
@@ -814,6 +845,11 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
  *		This function handles the question of whether or not an insertion
  *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
  *		insert on the page contained in buf when a choice must be made.
@@ -904,6 +940,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -926,7 +963,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -976,8 +1013,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1059,7 +1096,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_root;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1181,17 +1220,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1286,7 +1327,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1300,8 +1342,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1319,25 +1362,58 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate nondistinguishing key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf level,
+	 * since in general all pivot tuple values originate from leaf level high
+	 * keys.  This isn't just about avoiding unnecessary work, though;
+	 * truncating unneeded key suffix attributes can only be performed at the
+	 * leaf level anyway.  This is because a pivot tuple in a grandparent page
+	 * must guide a search not only to the correct parent page, but also to
+	 * the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split within leaf pages.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1530,7 +1606,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1559,22 +1634,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1592,9 +1655,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1957,7 +2018,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1987,7 +2048,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2248,7 +2309,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2283,7 +2344,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2316,6 +2378,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2380,6 +2444,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2395,8 +2460,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2409,6 +2474,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..72af1ef3c1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1422,6 +1494,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
+				/* absent attributes need explicit minus infinity values */
+				itup_key->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7940297305..54a4c64304 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -358,6 +364,9 @@ _bt_binsrch(Relation rel,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	if (!key->restorebinsrch)
 	{
 		low = P_FIRSTDATAKEY(opaque);
@@ -367,6 +376,7 @@ _bt_binsrch(Relation rel,
 	else
 	{
 		/* Restore result of previous binary search against same page */
+		Assert(!key->heapkeyspace || key->scantid != NULL);
 		Assert(P_ISLEAF(opaque));
 		low = key->low;
 		high = key->stricthigh;
@@ -446,6 +456,7 @@ _bt_binsrch(Relation rel,
 	if (key->savebinsrch)
 	{
 		Assert(isleaf);
+		Assert(key->scantid == NULL);
 		key->low = low;
 		key->stricthigh = stricthigh;
 		key->savebinsrch = false;
@@ -492,19 +503,31 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(key->minusinfkey || key->heapkeyspace);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -518,8 +541,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -570,8 +595,65 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.  (This is also a convenient point to check if
+	 * the !minusinfkey optimization can be used.)
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches (all !minusinfkey searches) are not interested in
+		 * keys where minus infinity is explicitly represented, since that's a
+		 * sentinel value that never appears in non-pivot tuples.  It is safe
+		 * for these searches to have their scankey considered greater than a
+		 * truncated pivot tuple iff the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in pivot tuple.  The only would-be "match" that will be
+		 * "missed" is a single leaf page's high key (the leaf page whose high
+		 * key the values from affected pivot tuple originate from).
+		 *
+		 * This optimization prevents an extra leaf page visit when the
+		 * insertion scankey would otherwise be equal.  If this tiebreaker
+		 * wasn't performed, code like _bt_readpage() and _bt_readnextpage()
+		 * would often end up moving right having found no matches on the leaf
+		 * page that their search lands on initially.
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 */
+		if (!key->minusinfkey && key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1088,7 +1170,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scan key fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..67cdb44cf5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e010bcdcfa..15090b26d2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's minusinfkey field.  This can be
+ *		thought of as explicitly representing that absent attributes in scan
+ *		key have minus infinity values.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,15 +97,38 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate high keys in heapkeyspace indexes.  Caller also
+	 * opts out of the "no minus infinity key" optimization, so search moves
+	 * left on scankey-equal downlink in parent, allowing VACUUM caller to
+	 * reliably relocate leaf page undergoing deletion.
+	 */
+	key->minusinfkey = !key->heapkeyspace;
 	key->savebinsrch = key->restorebinsrch = false;
 	key->low = key->stricthigh = InvalidOffsetNumber;
 	key->nextkey = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -103,9 +144,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are defensively
-		 * represented as NULL values, though they should still not
-		 * participate in comparisons.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values,
+		 * though they should still not participate in comparisons.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2043,38 +2084,238 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a MAXALIGN()'d item
+	 * pointer.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2088,15 +2329,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2116,16 +2359,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2135,8 +2388,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2148,7 +2408,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2159,18 +2423,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 950a61958d..c24b9a7c37 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,53 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * Btree version 4 (used by indexes initialized by PostgreSQL v12) made
+ * general changes to the on-disk representation to add support for
+ * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
+ * semantics in pg_upgrade scenarios.  We continue to offer support for
+ * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
+ * up to and including v10 to v12+ without requiring a REINDEX.
+ * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
+ * support upgrades from v11 to v12+ without requiring a REINDEX.
+ *
+ * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
+ * to BTREE_NOVAC_VERSION automatically.  v11's "no vacuuming" enhancement
+ * (the ability to skip full index scans during vacuuming) only requires
+ * two new metapage fields, which makes it possible to upgrade at any
+ * point that the metapage must be updated anyway (e.g. during a root page
+ * split).  Note also that there happened to be no changes in metapage
+ * layout for btree version 4.  All current metapage fields should have
+ * valid values set when a metapage WAL record is replayed.
+ *
+ * It's convenient to consider the "no vacuuming" enhancement (metapage
+ * layout compatibility) separately from heapkeyspace semantics, since
+ * each issue affects different areas.  This is just a convention; in
+ * practice a heapkeyspace index is always also a "no vacuuming" index.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
 #define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -203,22 +238,25 @@ typedef struct BTMetaPageData
  * their item pointer offset field, since pivot tuples never need to store a
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * bit is set, though that number doesn't include the trailing heap TID
+ * attribute sometimes stored in pivot tuples -- that's represented by the
+ * presence of BT_HEAP_TID_ATTR.  INDEX_ALT_TID_MASK is only used for pivot
+ * tuples at present, though it's possible that it will be used within
+ * non-pivot tuples in the future.  All pivot tuples must have
+ * INDEX_ALT_TID_MASK set as of BTREE_VERSION 4.
  *
  * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 status bits
+ * (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for future
+ * use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any number of
+ * attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +279,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +297,46 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -326,25 +402,55 @@ typedef BTStackData *BTStack;
  * _bt_search.  For details on its mutable state, see _bt_binsrch and
  * _bt_findinsertloc.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.
+ * Searches that are not specifically interested in keys with the value minus
+ * infinity (all searches bar those performed by VACUUM for page deletion)
+ * apply the optimization by setting the field to false.  The optimization
+ * avoids unnecessarily reading the left sibling of the leaf page that
+ * matching tuples can appear on first.  Work is saved when the insertion
+ * scankey happens to search on all the untruncated "separator" key attributes
+ * for some pivot tuple, without also providing a key value for a remaining
+ * truncated-in-pivot-tuple attribute.  Reasoning about minus infinity values
+ * specifically allows this case to use a special tie-breaker, guiding search
+ * right instead of left on the next level down.  This is particularly likely
+ * to help in the common case where insertion scankey has no scantid but has
+ * values for all other attributes, especially with indexes that happen to
+ * have few distinct values (once heap TID is excluded) on each leaf page.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +458,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +691,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +745,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..a4cbdff283 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -82,20 +82,16 @@ typedef struct xl_btree_insert
  *
  * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
  * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always explicitly log the left page high key value.
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * In the _R variants, the new item is one of the right page's tuples.  An
+ * IndexTuple representing the HIKEY of the left page follows.  We don't need
+ * this on leaf pages, because it's the same as the leftmost key in the new
+ * right page.
  *
  * Backup Blk 1: new right page
  *
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
-- 
2.17.1

v13-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchtext/x-patch; charset=US-ASCII; name=v13-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From 8c0c21b752dea0db00686b306a68980fad580806 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v13 6/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 157 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 181 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a7d060b3ec..151c6d5fdb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -74,6 +74,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -123,10 +125,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -139,6 +142,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -176,7 +180,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -195,11 +199,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -208,7 +215,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -266,7 +274,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -337,7 +345,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -361,6 +369,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -429,6 +438,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -921,6 +938,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_minusinfkey(state->rel, itup);
 
@@ -1525,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1925,6 +1971,101 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ *
+ * Alternative nbtree design that could be used perform "cousin verification"
+ * inexpensively/transitively (may make current approach clearer):
+ *
+ * A cousin leaf page has a lower bound that comes from its grandparent page
+ * rather than its parent page, as already discussed (note also that a "second
+ * cousin" leaf page gets its lower bound from its great-grandparent, and so
+ * on).  Any pivot tuple in the root page after the first tuple (which is an
+ * "absolute" negative infinity tuple, since its leftmost on the level) should
+ * separate every leaf page into <= and > pages.  Even with the current
+ * design, there should be an unbroken seam of identical-to-root-pivot high
+ * key separator values at the right edge of the <= pages, all the way down to
+ * (and including) the leaf level.  Recall that page deletion won't delete the
+ * rightmost child of a leaf page unless that page is the only child, and it
+ * is deleting the parent page as well.
+ *
+ * If we didn't truncate the item at first/negative infinity offset to zero
+ * attributes during internal page splits then there would also be an unbroken
+ * seam of identical-to-root-pivot "low key" separator values on the left edge
+ * of the pages that are > the separator value; this alternative design would
+ * allow us to verify the same invariants directly, without ever having to
+ * cross more than one level of the tree (we'd still have to cross one level
+ * because leaf pages would still not store a low key directly, and we'd still
+ * need bitwise-equality cross checks of downlink separator in parent against
+ * the low keys in their non-leaf children).
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	/* No need to use bt_mkscankey_minusinfkey() here */
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal certain remaining inconsistencies.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		OffsetNumber offnum;
+		Page		page;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch(state->rel, key, lbuf);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v13-0004-Add-split-after-new-tuple-optimization.patchtext/x-patch; charset=US-ASCII; name=v13-0004-Add-split-after-new-tuple-optimization.patchDownload

From a7bbaa4797a1e1cce4e3d75f3947e2c8893373ed Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v13 4/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values.  This can greatly help space utilization to split between two
groups of localized monotonically increasing values.

Without this patch, affected cases will reliably leave leaf pages no
more than about 50% full.  50/50 page splits are only appropriate with a
pattern of truly random insertions.  The optimization is very similar to
the long established fillfactor optimization used during rightmost page
splits, where we usually leave the new left side of the split 90% full.
Split-after-new-tuple page splits target essentially the same case.  The
splits targeted are those at the rightmost point of a localized grouping
of values, rather than those at the rightmost point of the entire key
space.  Localized monotonically increasing insertion patterns are
presumed to be fairly common in real-world applications.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsplitloc.c | 234 +++++++++++++++++++++++-
 1 file changed, 231 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 6027b1f1a1..7f25b0013f 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -65,6 +65,11 @@ static void _bt_recordsplit(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_splitafternewitemoff(Relation rel, Page page,
+						 int olddataitemstotal, int leaffillfactor,
+						 OffsetNumber newitemoff, IndexTuple newitem,
+						 bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(Relation rel, Page page,
 				 FindSplitData *state, int perfectpenalty,
 				 OffsetNumber newitemoff, IndexTuple newitem,
@@ -269,9 +274,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default mode), while applying
-	 * rightmost optimization where appropriate.  Either of the two other
-	 * fallback modes may be required for cases with a large number of
-	 * duplicates around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback modes may be required for cases with a
+	 * large number of duplicates around the original/space-optimal split
+	 * point.
 	 *
 	 * Default mode gives some weight to suffix truncation in deciding a split
 	 * point on leaf pages.  It attempts to select a split point where a
@@ -293,6 +299,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_splitafternewitemoff(rel, page, olddataitemstotal,
+									  leaffillfactor, newitemoff, newitem,
+									  &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			usemult = false;
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -532,6 +576,190 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in monotonically
+ * increasing order, but that shouldn't matter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by 50:50/!usemult
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index (i.e. typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_splitafternewitemoff(Relation rel, Page page, int olddataitemstotal,
+						 int leaffillfactor, OffsetNumber newitemoff,
+						 IndexTuple newitem, bool *usemult)
+{
+	OffsetNumber maxoff;
+	int16		nkeyatts;
+	ItemId		itemid;
+	int			interpdataitemstotal;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(!P_RIGHTMOST((BTPageOpaque) PageGetSpecialPointer(page)));
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (IndexTupleSize(newitem) >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2))
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples.  Surmise that
+	 * page has equisized tuples when page layout is consistent with having
+	 * maxoff-1 non-pivot tuples that are all the same size as the newly
+	 * inserted tuple (note that the possibly-truncated high key isn't counted
+	 * in olddataitemstotal).
+	 */
+	interpdataitemstotal =
+		(MAXALIGN(IndexTupleSize(newitem)) + sizeof(ItemIdData)) * (maxoff - 1);
+	if (interpdataitemstotal != olddataitemstotal)
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(page, maxoff);
+		tup = (IndexTuple) PageGetItem(page, itemid);
+		keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, but is deemed suitable for the
+	 * optimization, caller splits at the point immediately after the would-be
+	 * position of the new item, and immediately before the item after the new
+	 * item.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(page, OffsetNumberPrev(newitemoff));
+	tup = (IndexTuple) PageGetItem(page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(rel, tup, newitem);
+
+	/*
+	 * Don't allow caller to split after a new item when it will result in a
+	 * split point to the right of the point that a leaf fillfactor split
+	 * would use -- have caller apply leaf fillfactor instead.  There is no
+	 * advantage to being very aggressive in any case.  It may not be legal to
+	 * split very close to maxoff.
+	 */
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v13-0005-Add-high-key-continuescan-optimization.patchtext/x-patch; charset=US-ASCII; name=v13-0005-Add-high-key-continuescan-optimization.patchDownload

From a5c8bf68f322630c21777fece534ddb3c0e666f1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v13 5/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 54a4c64304..5e0a33383a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1364,6 +1364,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1412,16 +1413,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 146de1b2e4..7c795c6bb6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1362,11 +1362,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1376,6 +1379,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1389,24 +1393,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1418,6 +1419,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1429,11 +1431,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1564,8 +1579,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1582,6 +1597,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v13-0007-DEBUG-Add-pageinspect-instrumentation.patchtext/x-patch; charset=US-ASCII; name=v13-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 7d607c792cea59e269e8ff5c94c15b0bb23dc2dc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v13 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

#59

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#58)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 26/02/2019 12:31, Peter Geoghegan wrote:

On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different.

Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once.

Great, looks much simpler now, indeed! Now I finally understand the
algorithm.

I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?

If the performance is acceptable without it, let's not bother. We can
optimize later.

What would be the worst case scenario for this? Splitting a page that
has as many tuples as possible, I guess, so maybe inserting into a table
with a single-column index, with 32k BLCKSZ. Have you done performance
testing on something like that?

Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)

Rounding up the allocation to BLCKSZ seems like a premature
optimization. Even if it saved some cycles, I don't think it's worth the
trouble of having to explain all that in the comment.

I think you could change the curdelta, leftfree, and rightfree fields in
SplitPoint to int16, to make the array smaller.

Other changes from your prototype:

* I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.

Ok.

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

My patch treated the new item the same as all the old items, in
_bt_checksplitloc(), so it didn't need newitemisonright. You still need
it with your approach.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

I looked at it very briefly. Yeah, it's pretty simple now. Nice!

About this comment on _bt_findsplit_loc():

/*
* _bt_findsplitloc() -- find an appropriate place to split a page.
*
* The main goal here is to equalize the free space that will be on each
* split page, *after accounting for the inserted tuple*. (If we fail to
* account for it, we might find ourselves with too little room on the page
* that it needs to go into!)
*
* If the page is the rightmost page on its level, we instead try to arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.
* This is the same as nbtsort.c produces for a newly-created tree. Note
* that leaf and nonleaf pages use different fillfactors.
*
* We are passed the intended insert position of the new tuple, expressed as
* the offsetnumber of the tuple it must go in front of (this could be
* maxoff+1 if the tuple is to go at the end). The new tuple itself is also
* passed, since it's needed to give some weight to how effective suffix
* truncation will be. The implementation picks the split point that
* maximizes the effectiveness of suffix truncation from a small list of
* alternative candidate split points that leave each side of the split with
* about the same share of free space. Suffix truncation is secondary to
* equalizing free space, except in cases with large numbers of duplicates.
* Note that it is always assumed that caller goes on to perform truncation,
* even with pg_upgrade'd indexes where that isn't actually the case
* (!heapkeyspace indexes). See nbtree/README for more information about
* suffix truncation.
*
* We return the index of the first existing tuple that should go on the
* righthand page, plus a boolean indicating whether the new tuple goes on
* the left or right page. The bool is necessary to disambiguate the case
* where firstright == newitemoff.
*
* The high key for the left page is formed using the first item on the
* right page, which may seem to be contrary to Lehman & Yao's approach of
* using the left page's last item as its new high key on the leaf level.
* It isn't, though: suffix truncation will leave the left page's high key
* fully equal to the last item on the left page when two tuples with equal
* key values (excluding heap TID) enclose the split point. It isn't
* necessary for a new leaf high key to be equal to the last item on the
* left for the L&Y "subtree" invariant to hold. It's sufficient to make
* sure that the new leaf high key is strictly less than the first item on
* the right leaf page, and greater than the last item on the left page.
* When suffix truncation isn't possible, L&Y's exact approach to leaf
* splits is taken (actually, a tuple with all the keys from firstright but
* the heap TID from lastleft is formed, so as to not introduce a special
* case).
*
* Starting with the first right item minimizes the divergence between leaf
* and internal splits when checking if a candidate split point is legal.
* It is also inherently necessary for suffix truncation, since truncation
* is a subtractive process that specifically requires lastleft and
* firstright inputs.
*/

This is pretty good, but I think some copy-editing can make this even
better. If you look at the old comment, it had this structure:

1. Explain what the function does.
2. Explain the arguments
3. Explain the return value.

The additions to this comment broke the structure. The explanations of
argument and return value are now in the middle, in 3rd and 4th
paragraphs. And the 3rd paragraph that explains the arguments, now also
goes into detail on what the function does with it. I'd suggest moving
things around to restore the old structure, that was more clear.

The explanation of how the high key for the left page is formed (5th
paragraph), seems out-of-place here, because the high key is not formed
here.

Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
the function effectively uses a different fillfactor in some other
scenarios too, not only when it's the rightmost page.

In the function itself:

* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
* you imagine that the new item is already on the original page (the
* final number of splits may be slightly lower because not all splits
* will be legal). Even still, add space for an extra two splits out of
* sheer paranoia.
*/
state.maxsplits = maxoff + 2;
state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
state.nsplits = 0;

I wouldn't be that paranoid. The code that populates the array is pretty
straightforward.

/*
* Scan through the data items and calculate space usage for a split at
* each possible position. We start at the first data offset rather than
* the second data offset to handle the "newitemoff == first data offset"
* case (otherwise, a split whose firstoldonright is the first data offset
* can't be legal, and won't actually end up being recorded by
* _bt_recordsplit).
*
* Still, it's typical for almost all calls to _bt_recordsplit to
* determine that the split is legal, and therefore enter it into the
* candidate split point array for later consideration.
*/

Suggestion: Remove the "Still" word. The observation that typically all
split points are legal is valid, but it seems unrelated to the first
paragraph. (Do we need to mention it at all?)

/*
* If the new item goes as the last item, record the split point that
* leaves all the old items on the left page, and the new item on the
* right page. This is required because a split that leaves the new item
* as the firstoldonright won't have been reached within the loop. We
* always record every possible split point.
*/

Suggestion: Remove the last sentence. An earlier comment already said
that we calculate space usage for a split at each possible position,
that seems sufficient. Like it was before this patch.

/*
* Find lowest possible penalty among split points currently regarded as
* acceptable -- the "perfect" penalty. The perfect penalty often saves
* _bt_bestsplitloc() additional work around calculating penalties. This
* is also a convenient point to determine if default mode worked out, or
* if we should instead reassess which split points should be considered
* acceptable (split interval, and possibly fillfactormult).
*/
perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
newitem, &secondmode);

ISTM that figuring out which "mode" we want to operate in is actually
the *primary* purpose of _bt_perfect_penalty(). We only really use the
penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
suggest changing the function name to something like _bt_choose_mode(),
and have secondmode be the primary return value from it, with
perfectpenalty being the secondary result through a pointer.

It doesn't really choose the mode, either, though. At least after the
next "Add split after new tuple optimization" patch. The caller has a
big part in choosing what to do. So maybe split _bt_perfect_penalty into
two functions: _bt_perfect_penalty(), which just computes the perfect
penalty, and _bt_analyze_split_interval(), which determines how many
duplicates there are in the top-N split points.

BTW, I like the word "strategy", like you called it in the comment on
SplitPoint struct, better than "mode".

+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;

How about removing the "usemult" variable, and just check if
fillfactormult == 0.5?

/*
* There are a much smaller number of candidate split points when
* splitting an internal page, so we can afford to be exhaustive. Only
* give up when pivot that will be inserted into parent is as small as
* possible.
*/
if (!state->is_leaf)
return MAXALIGN(sizeof(IndexTupleData) + 1);

Why are there fewer candidate split points on an internal page?

- Heikki

#60

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#58)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Some comments on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
Mostly about code comments. In general, I think a round of copy-editing
the comments, to use simpler language, would do good. The actual code
changes look good to me.

/*
* _bt_findinsertloc() -- Finds an insert location for a tuple
*
* On entry, *bufptr contains the page that the new tuple unambiguously
* belongs on. This may not be quite right for callers that just called
* _bt_check_unique(), though, since they won't have initially searched
* using a scantid. They'll have to insert into a page somewhere to the
* right in rare cases where there are many physical duplicates in a
* unique index, and their scantid directs us to some page full of
* duplicates to the right, where the new tuple must go. (Actually,
* since !heapkeyspace pg_upgraded'd non-unique indexes never get a
* scantid, they too may require that we move right. We treat them
* somewhat like unique indexes.)

Seems confusing to first say assertively that "*bufptr contains the page
that the new tuple unambiguously belongs to", and then immediately go on
to list a whole bunch of exceptions. Maybe just remove "unambiguously".

@@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel,
* If this page was incompletely split, finish the split now. We
* do this while holding a lock on the left sibling, which is not
* good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
+			 * operation.  But this should only happen when inserting into a
+			 * unique index that has more than an entire page for duplicates
+			 * of the value being inserted.  (!heapkeyspace non-unique indexes
+			 * are an exception, once again.)
*/
if (P_INCOMPLETE_SPLIT(lpageop))
{

This happens very seldom, because you only get an incomplete split if
you crash in the middle of a page split, and that should be very rare. I
don't think we need to list more fine-grained conditions here, that just
confuses the reader.

/*
* _bt_useduplicatepage() -- Settle for this page of duplicates?
*
* Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
* as a part of the keyspace. If there were many tuples of the same
* value spanning more than one leaf page, a new tuple of that same
* value could legally be placed on any one of the pages.
*
* This function handles the question of whether or not an insertion
* of a duplicate into a pg_upgrade'd !heapkeyspace index should
* insert on the page contained in buf when a choice must be made.
* Preemptive microvacuuming is performed here when that could allow
* caller to insert on to the page in buf.
*
* Returns true if caller should proceed with insert on buf's page.
* Otherwise, caller should move on to the page to the right (caller
* must always be able to still move right following call here).
*/

So, this function is only used for legacy pg_upgraded indexes. The
comment implies that, but doesn't actually say it.

/*
* Get tie-breaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
*
* Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
* points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
* (since all pivot tuples must as of BTREE_VERSION 4). When non-pivot
* tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
* probably also contain a heap TID at the end of the tuple. We currently
* assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
* heapkeyspace indexes (and that a tuple without it set must be a non-pivot
* tuple), but it might also be used by non-pivot tuples in the future.
* pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
* tuples that actually originated with the truncation of one or more
* attributes.
*/
#define BTreeTupleGetHeapTID(itup) ...

The comment claims that "all pivot tuples must be as of BTREE_VERSION
4". I thought that all internal tuples are called pivot tuples, even on
version 3. I think what this means to say is that this macro is only
used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
have a heap TID in BTREE_VERSION 4 indexes.

This macro (and many others in nbtree.h) is quite complicated. A static
inline function might be easier to read.

@@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,

if (BufferIsValid(metabuf))
{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_root;
xlmeta.root = metad->btm_root;
xlmeta.level = metad->btm_level;
xlmeta.fastroot = metad->btm_fastroot;

'xlmeta.version' is set incorrectly.

/*
* Btree version 4 (used by indexes initialized by PostgreSQL v12) made
* general changes to the on-disk representation to add support for
* heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace
* semantics in pg_upgrade scenarios. We continue to offer support for
* BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions
* up to and including v10 to v12+ without requiring a REINDEX.
* Similarly, we continue to offer support for BTREE_NOVAC_VERSION to
* support upgrades from v11 to v12+ without requiring a REINDEX.
*
* We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION
* to BTREE_NOVAC_VERSION automatically. v11's "no vacuuming" enhancement
* (the ability to skip full index scans during vacuuming) only requires
* two new metapage fields, which makes it possible to upgrade at any
* point that the metapage must be updated anyway (e.g. during a root page
* split). Note also that there happened to be no changes in metapage
* layout for btree version 4. All current metapage fields should have
* valid values set when a metapage WAL record is replayed.
*
* It's convenient to consider the "no vacuuming" enhancement (metapage
* layout compatibility) separately from heapkeyspace semantics, since
* each issue affects different areas. This is just a convention; in
* practice a heapkeyspace index is always also a "no vacuuming" index.
*/
#define BTREE_METAPAGE 0 /* first page is meta */
#define BTREE_MAGIC 0x053162 /* magic number of btree pages */
#define BTREE_VERSION 4 /* current version number */
#define BTREE_MIN_VERSION 2 /* minimal supported version number */
#define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */

I find this comment difficult to read. I suggest rewriting it to:

/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.
*
* Btree version 3 was used in PostgreSQL v11. It is mostly the same as
* version 4, but heap TIDs were not part of the keyspace. Index tuples
* with duplicate keys could be stored in any order. We continue to
* support reading and writing Btree version 3, so that they don't need
* to be immediately re-indexed at pg_upgrade. In order to get the new
* heapkeyspace semantics, however, a REINDEX is needed.
*
* Btree version 2 is the same as version 3, except for two new fields
* in the metapage that were introduced in version 3. A version 2 metapage
* will be automatically upgraded to version 3 on the first insert to it.
*/

Now that the index tuple format becomes more complicated, I feel that
there should be some kind of an overview explaining the format. All the
information is there, in the comments in nbtree.h, but you have to piece
together all the details to get the overall picture. I wrote this to
keep my head straight:

B-tree tuple format
===================

Leaf tuples
-----------

t_tid | t_info | key values | INCLUDE columns if any

t_tid points to the heap TID.

Pivot tuples
------------

All tuples on internal pages are pivot tuples. Also, the high keys on
leaf pages.

t_tid | t_info | key values | [heap TID]

The INDEX_ALT_TID_MASK bit in t_info is set.

The block number in 't_tid' points to the lower B-tree page.

The lower bits in 't_tid.ip_posid' store the number of keys stored (it
can be less than the number of keys in the index, if some keys have been
suffix-truncated). If BT_HEAP_TID_ATTR flag is set, there's an extra
heap TID field after the key datums.

(In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set. In
that case, the number keys is implicitly the same as the number of keys
in the index, and there is no heap TID.)

I think adding something like this in nbtree.h would be good.

- Heikki

#61

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#59)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Great, looks much simpler now, indeed! Now I finally understand the
algorithm.

Glad to hear it. Thanks for the additional review!

Attached is v14, which has changes based on your feedback. This
includes changes based on your more recent feedback on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though
I'll respond to those points directly in a later email.

v14 also changes the logic by which we decide if alternative strategy
should be used to use leftmost and rightmost splits for the entire
page, rather than accessing the page directly. We always handle the
newitem-at-end edge case correctly now, since the new "top down"
approach has all legal splits close at hand. This is more elegant,
more obviously correct, and even more effective, at least in some
cases -- it's another example of why "top down" is the superior
approach for nbtsplitloc.c. This made my "UK land registry data" index
have about 2.5% fewer leaf pages than with v13, which is small but
significant.

Separately, I made most of the new nbtsplitloc.c functions use a
FindSplitData argument in v14, which simplifies their signatures quite
a bit.

What would be the worst case scenario for this? Splitting a page that
has as many tuples as possible, I guess, so maybe inserting into a table
with a single-column index, with 32k BLCKSZ. Have you done performance
testing on something like that?

I'll test that (added to my project TODO list), though it's not
obvious that that's the worst case. Page splits will be less frequent,
and have better choices about where to split.

Rounding up the allocation to BLCKSZ seems like a premature
optimization. Even if it saved some cycles, I don't think it's worth the
trouble of having to explain all that in the comment.

Removed that optimization.

I think you could change the curdelta, leftfree, and rightfree fields in
SplitPoint to int16, to make the array smaller.

Added this alternative optimization to replace the BLCKSZ allocation
thing. I even found a way to get the array elements down to 8 bytes,
but that made the code noticeably slower with "many duplicates"
splits, so I didn't end up doing that (I used bitfields, plus the same
pragmas that we use to make sure that item pointers are packed).

* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?

My patch treated the new item the same as all the old items, in
_bt_checksplitloc(), so it didn't need newitemisonright. You still need
it with your approach.

I would feel better about it if we stuck to the same method for
calculating if a split point is legal as before (the only difference
being that we pessimistically add heap TID overhead to new high key on
leaf level). That seems safer to me.

Changes to my own code since v12:

* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.

I looked at it very briefly. Yeah, it's pretty simple now. Nice!

I can understand why it might be difficult to express an opinion on
the heuristics themselves. The specific cut-off points (e.g. details
of what "heap TID adjacency" actually means) are not that easy to
defend with a theoretical justification, though they have been
carefully tested. I think it's worth comparing the "split after new
tuple" optimization to the traditional leaf fillfactor of 90, which is
a very similar situation. Why should it be 90? Why not 85, or 95? Why
is it okay to assume that the rightmost page shouldn't be split 50/50?

The answers to all of these questions about the well established idea
of a leaf fillfactor boil down to this: it's very likely to be correct
on average, and when it isn't correct the problem is self-limiting,
and has an infinitesimally small chance of continually recurring
(unless you imagine an *adversarial* case). Similarly, it doesn't
matter if these new heuristics get it wrong once every 1000 splits (a
very pessimistic estimate), because even then those will cancel each
other out in the long run. It is necessary to take a holistic view of
things. We're talking about an optimization that makes the two largest
TPC-C indexes over 40% smaller -- I can hold my nose if I must in
order to get that benefit. There were also a couple of indexes in the
real-world mouse genome database that this made much smaller, so this
will clearly help in the real world.

Besides all this, the "split after new tuple" optimization fixes an
existing worst case, rather than being an optimization, at least in my
mind. It's not supposed to be possible to have leaf pages that are all
only 50% full without any deletes, and yet we allow it to happen in
this one weird case. Even completely random insertions result in 65% -
70% average space utilization, so the existing worst case really is
exceptional. We are forced to take a holistic view, and infer
something about the pattern of insertions over time, even though
holistic is a dirty word.

(New header comment block over _bt_findsplitloc())

This is pretty good, but I think some copy-editing can make this even
better

I've restored the old structure of the _bt_findsplitloc() header comments.

The explanation of how the high key for the left page is formed (5th
paragraph), seems out-of-place here, because the high key is not formed
here.

Moved that to _bt_split(), which seems like a good compromise.

Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
the function effectively uses a different fillfactor in some other
scenarios too, not only when it's the rightmost page.

Done.

state.maxsplits = maxoff + 2;
state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits));
state.nsplits = 0;

I wouldn't be that paranoid. The code that populates the array is pretty
straightforward.

Done that way. But are you sure? Some of the attempts to create a new
split point are bound to fail, because they try to put everything
(including new item) on one size of the split. I'll leave the
assertion there.

* Still, it's typical for almost all calls to _bt_recordsplit to
* determine that the split is legal, and therefore enter it into the
* candidate split point array for later consideration.
*/

Suggestion: Remove the "Still" word. The observation that typically all
split points are legal is valid, but it seems unrelated to the first
paragraph. (Do we need to mention it at all?)

Removed the second paragraph entirely.

/*
* If the new item goes as the last item, record the split point that
* leaves all the old items on the left page, and the new item on the
* right page. This is required because a split that leaves the new item
* as the firstoldonright won't have been reached within the loop. We
* always record every possible split point.
*/

Suggestion: Remove the last sentence.

Agreed. Removed.

ISTM that figuring out which "mode" we want to operate in is actually
the *primary* purpose of _bt_perfect_penalty(). We only really use the
penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
suggest changing the function name to something like _bt_choose_mode(),
and have secondmode be the primary return value from it, with
perfectpenalty being the secondary result through a pointer.

I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that
its primary purpose is to decide on a strategy (which is what I'm now
calling a mode, per your request a bit further down). It still returns
perfectpenalty, though, since that seemed more natural to me, probably
because its style matches the style of caller/_bt_findsplitloc().
perfectpenalty isn't a mere optimization -- it is important to prevent
many duplicates mode from going overboard with suffix truncation. It
does more than just save _bt_bestsplitloc() cycles, which I've tried
to make clearer in v14.

It doesn't really choose the mode, either, though. At least after the
next "Add split after new tuple optimization" patch. The caller has a
big part in choosing what to do. So maybe split _bt_perfect_penalty into
two functions: _bt_perfect_penalty(), which just computes the perfect
penalty, and _bt_analyze_split_interval(), which determines how many
duplicates there are in the top-N split points.

Hmm. I didn't create a _bt_analyze_split_interval(), because now
_bt_perfect_penalty()/_bt_strategy() is responsible for setting the
perfect penalty in all cases. It was a mistake for me to move some
perfect penalty stuff for alternative modes/strategies out to the
caller in v13. In v14, we never make _bt_findsplitloc() change its
perfect penalty -- it only changes its split interval, based on the
strategy/mode, possibly after sorting. Let me know what you think of
this.

BTW, I like the word "strategy", like you called it in the comment on
SplitPoint struct, better than "mode".

I've adopted that terminology in v14 -- it's always "strategy", never "mode".

How about removing the "usemult" variable, and just check if
fillfactormult == 0.5?

I need to use "usemult" to determine if the "split after new tuple"
optimization should apply leaf fillfactor, or should instead split at
the exact point after the newly inserted tuple -- it's very natural to
have a single bool flag for that. It's seems simpler to continue to
use "usemult" for everything, and not distinguish "split after new
tuple" as a special case later on. (Besides, the master branch already
uses a bool for this, even though it handles far fewer things.)

/*
* There are a much smaller number of candidate split points when
* splitting an internal page, so we can afford to be exhaustive. Only
* give up when pivot that will be inserted into parent is as small as
* possible.
*/
if (!state->is_leaf)
return MAXALIGN(sizeof(IndexTupleData) + 1);

Why are there fewer candidate split points on an internal page?

The comment should say that there is typically a much smaller split
interval (this used to be controlled by limiting the size of the array
initially -- should have updated this for v13 of the patch). I believe
that you understand that, and are interested in why the split interval
itself is different on internal pages. Or why we are more conservative
with internal pages in general. Assuming that's what you meant, here
is my answer:

The "Prefix B-Tree" paper establishes the idea that there are
different split intervals for leaf pages and internal pages (which it
calls branch pages). We care about different things in each case. With
leaf pages, we care about choosing the split point that allows us to
create the smallest possible pivot tuple as our secondary goal
(primary goal is balancing space). With internal pages, we care about
choosing the smallest tuple to insert into parent of internal page
(often the root) as our secondary goal, but don't care about
truncation, because _bt_split() won't truncate new pivot. That's why
the definition of "penalty" varies according to whether we're
splitting an internal page or a leaf page. Clearly the idea of having
separate split intervals is well established, and makes sense.

It's fair to ask if I'm being too conservative (or not conservative
enough) with split interval in either case. Unfortunately, the Prefix
B-Tree paper never seems to give practical advice about how to come up
with an interval. They say:

"We have not analyzed the influence of sigma L [leaf interval] or
sigma B [branch/internal interval] on the performance of the trees. We
expect such an analysis to be quite involved and difficult. We are
quite confident, however, that small split intervals improve
performance considerably. Sets of keys that arise in practical
applications are often far from random, and clusters of similar keys
differing only in the last few letters (e.g. plural forms) are quite
common."

I am aware of another, not-very-notable paper that tries to to impose
some theory here, but doesn't really help much [1]https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 -- Peter Geoghegan. Anyway, I've found
that I was too conservative with split interval for internal pages. It
pays to make internal interval that higher than leaf interval, because
internal pages cover a much bigger portion of the key space than leaf
pages, which will tend to get filled up one way or another. Leaf pages
cover a tight part of the key space, in contrast. In v14, I've
increased internal page to 18, a big increase from 3, and twice what
it is for leaf splits (still 9 -- no change there). This mostly isn't
that different to 3, since there usually are pivot tuples that are all
the same size anyway. However, with cases where suffix truncation
makes pivot tuples a lot smaller (e.g. UK land registry test case),
this makes the items in the root a lot smaller on average, and even
makes the whole index smaller. My entire test suite has a few cases
that are noticeably improved by this change, and no cases that are
hurt.

I'm going to have to revalidate the performance of long-running
benchmarks with this change, so this should be considered provisional.
I think that it will probably be kept, though. Not expecting it to
noticeably impact either BenchmarkSQL or pgbench benchmarks.

[1]: https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v14-0001-Refactor-nbtree-insertion-scankeys.patchapplication/octet-stream; name=v14-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From eed4ac506df1ee54faf82d820fde8b1943140f34 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v14 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() now falls out of its loop
immediately in the common case where it's already clear that there
couldn't possibly be a duplicate.  More importantly, the new
_bt_check_unique() scheme makes it a lot easier to manage cached binary
search effort afterwards, from within _bt_findinsertloc().  This is
needed for the upcoming patch to make nbtree tuples unique by treating
heap TID as a final tie-breaker column.

Based on a suggestion by Andrey Lepikhov.
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/nbtinsert.c | 334 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 168 ++++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  98 +++-----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 ++++-
 8 files changed, 429 insertions(+), 320 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a767..053ac9d192 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -138,14 +138,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -837,8 +837,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1029,7 +1029,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1081,7 +1081,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1110,11 +1110,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1302,8 +1303,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1316,8 +1317,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1422,8 +1423,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1863,13 +1863,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1882,13 +1881,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1904,14 +1902,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2b18028823..f143ea8be2 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -97,7 +97,9 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.
+ *		don't actually insert.  If rel is a unique index, then every call
+ *		here is a checkingunique call (i.e. every call does a duplicate
+ *		check, though perhaps only a tentative check).
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -110,18 +112,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,7 +142,6 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -179,8 +176,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +215,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +239,12 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, itup_key, itup, heapRel, buf,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +271,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -287,10 +283,16 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains mutable state used
+		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start own binary search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
+									   itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
@@ -301,7 +303,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,9 +311,9 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
+ * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
+ * reuse most of the work of our initial binary search to find conflicting
+ * tuples.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -326,14 +328,14 @@ top:
  * core code must redo the uniqueness check later.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -349,9 +351,17 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Save binary search bounds.  Note that this is also used within
+	 * _bt_findinsertloc() later.
+	 */
+	itup_key->savebinsrch = true;
+	offset = _bt_binsrch(rel, itup_key, buf);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(itup_key->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,6 +374,26 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
+			 * consideration to items that are definitely duplicates in most
+			 * cases (though not when original page is empty, or when initial
+			 * offset is past the end of the original page, which may indicate
+			 * that we'll have to examine a second or subsequent page).
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, provided initial offset
+			 * isn't past end of the initial page (and provided page has at
+			 * least one item).
+			 */
+			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
+			{
+				Assert(itup_key->low >= P_FIRSTDATAKEY(opaque));
+				Assert(itup_key->low <= itup_key->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
@@ -378,7 +408,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			 * first, so that we didn't actually get to exit any sooner
 			 * anyway. So now we just advance over killed items as quickly as
 			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * item.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +421,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +582,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,39 +644,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() callers arrange for their insertion scan key to
+ *		save the progress of the last binary search performed.  No additional
+ *		binary search comparisons occur in the common case where there was no
+ *		existing duplicate tuple, though we may occasionally still not be able
+ *		to reuse their work for our own reasons.  Even when there are garbage
+ *		duplicates, very few binary search comparisons will be performed
+ *		without being strictly necessary.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		The caller should hold an exclusive lock on *bufptr in all cases.  On
+ *		exit,  bufptr points to the chosen insert location in all cases.  If
+ *		we have to move right, the lock and pin on the original page will be
+ *		released, and the new page returned to the caller is exclusively
+ *		locked instead.  In any case, we return the offset that caller should
+ *		use to insert into the buffer pointed to by bufptr on return.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It has to happen here, since it may invalidate a
+ *		_bt_check_unique() caller's cached binary search work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +706,36 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * The checkingunique (restorebinsrch) case may well have established
+		 * bounds within _bt_check_unique()'s binary search that preclude the
+		 * need for a further high key check.  This fastpath isn't used when
+		 * there are no items on the existing page (other than high key), or
+		 * when it looks like the new item belongs last on the page, but it
+		 * might go on a later page instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
+		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +778,98 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * Perform microvacuuming of the page we're about to insert tuple on if it
+	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
+	 * forestall a page split, though.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first page locked is also where
+	 * new tuple should be inserted
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	itup_key->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_key, buf);
+	Assert(!itup_key->restorebinsrch);
+	Assert(!restorebinsrch || newitemoff == _bt_binsrch(rel, itup_key, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
+ *		insert on the page contained in buf when a choice must be made.
+ *		Preemptive microvacuuming is performed here when that could allow
+ *		caller to insert on to the page in buf.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right (caller
+ *		must always be able to still move right following call here).
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -1189,8 +1275,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user
-	 * data.
+	 * rightmost page case), all the items on the right half will be user data
+	 * (there is no existing high key that needs to be relocated to the new
+	 * right page).
 	 */
 	rightoff = P_HIKEY;
 
@@ -2310,24 +2397,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..7940297305 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -346,27 +333,45 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.
+ * on the buffer.  When key.savebinsrch is set, modifies mutable fields
+ * of insertion scan key, so that a subsequent call where caller sets
+ * key.savebinsrch can reuse the low and strict high bound of original
+ * binary search.  Callers that use these fields directly must be
+ * prepared for the case where stricthigh isn't on the same page (it
+ * exceeds maxoff for the page), and the case where there are no items
+ * on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+		isleaf = P_ISLEAF(opaque);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		Assert(P_ISLEAF(opaque));
+		low = key->low;
+		high = key->stricthigh;
+		isleaf = true;
+	}
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -375,8 +380,19 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
+	{
+		if (key->savebinsrch)
+		{
+			Assert(isleaf);
+			/* Caller can't use stricthigh */
+			key->low = low;
+			key->stricthigh = high;
+		}
+		key->savebinsrch = false;
+		key->restorebinsrch = false;
 		return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,9 +406,12 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
+	if (!key->restorebinsrch)
+		high++;					/* establish the loop invariant for high */
+	key->restorebinsrch = false;
+	stricthigh = high;			/* high initially strictly higher */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,12 +419,21 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
 	}
 
 	/*
@@ -415,7 +443,14 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (key->savebinsrch)
+	{
+		Assert(isleaf);
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
+	}
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +463,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +485,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +518,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +605,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +852,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +882,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +914,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +961,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +982,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1085,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1124,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..e010bcdcfa 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,39 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->savebinsrch = key->restorebinsrch = false;
+	key->low = key->stricthigh = InvalidOffsetNumber;
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +101,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are defensively
+		 * represented as NULL values, though they should still not
+		 * participate in comparisons.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +125,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..950a61958d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.  For details on its mutable state, see _bt_binsrch and
+ * _bt_findinsertloc.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v14-0003-Consider-secondary-factors-during-nbtree-splits.patchapplication/octet-stream; name=v14-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From aae225442012f8191638505f1d78e0257cb6fc42 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v14 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pre-pg_upgrade'd v3 indexes make use of these
optimizations.  Benchmarking has shown that even v3 indexes benefit,
despite the fact that suffix truncation will only truncate non-key
attributes in INCLUDE indexes.  Grouping relatively similar tuples
together is beneficial in and of itself, since it reduces the number of
leaf pages that must be accessed by subsequent index scans.
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 294 +---------
 src/backend/access/nbtree/nbtsplitloc.c | 750 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 863 insertions(+), 294 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 9c0b4718b6..325bd25c29 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -657,6 +657,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 818683ac2e..907dcb36be 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -324,7 +297,9 @@ top:
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
  * tuples.  This won't be usable if caller's tuple is determined to not belong
- * on buf following scantid being filled-in.
+ * on buf following scantid being filled-in, but that should be very rare in
+ * practice, since the logic for choosing a leaf split point works hard to
+ * avoid splitting within a group of duplicates.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -913,8 +888,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -1009,7 +983,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1702,264 +1676,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..cd24480634
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,750 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval (default strategy only) */
+#define MAX_LEAF_INTERVAL			9
+#define MAX_INTERNAL_INTERVAL		18
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} FindSplitStrat;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int16		curdelta;		/* current leftfree/rightfree delta */
+	int16		leftfree;		/* space left on left page post-split */
+	int16		rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recsplitloc */
+	Relation	rel;			/* index relation */
+	Page		page;			/* page undergoing split */
+	IndexTuple	newitem;		/* new item (cause of page split) */
+	Size		newitemsz;		/* size of newitem (includes line pointer) */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			interval;		/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
+				 bool *newitemonleft);
+static int _bt_strategy(FindSplitData *state, SplitPoint *lowpage,
+			 SplitPoint *highpage, FindSplitStrat *strategy);
+static inline int _bt_split_penalty(FindSplitData *state, SplitPoint *split);
+static inline IndexTuple _bt_split_lastleft(FindSplitData *state,
+				   SplitPoint *split);
+static inline IndexTuple _bt_split_firstright(FindSplitData *state,
+					 SplitPoint *split);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * there are a number of further special cases where fillfactor is not
+ * applied in the standard way.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	FindSplitStrat strategy;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	double		fillfactormult;
+	bool		usemult;
+	SplitPoint	lowpage,
+				highpage;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.rel = rel;
+	state.page = page;
+	state.newitem = newitem;
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all points
+	 * between tuples will be legal).
+	 */
+	state.maxsplits = maxoff;
+	state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (any other split whose firstoldonright is the first data offset
+	 * can't be legal, though, and so won't actually end up being recorded in
+	 * first loop iteration).
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default strategy), while applying
+	 * rightmost where appropriate.  Either of the two other fallback
+	 * strategies may be required for cases with a large number of duplicates
+	 * around the original/space-optimal split point.
+	 *
+	 * Default strategy gives some weight to suffix truncation in deciding a
+	 * split point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = state.is_rightmost;
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (state.is_rightmost)
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		/* fillfactormult not used, but be tidy */
+		fillfactormult = 0.50;
+	}
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	state.interval = Min(Max(1, state.nsplits * 0.05),
+						 state.is_leaf ? MAX_LEAF_INTERVAL :
+						 MAX_INTERNAL_INTERVAL);
+
+	/*
+	 * Save leftmost and rightmost splits for page before original ordinal
+	 * sort order is lost by delta/fillfactormult sort
+	 */
+	lowpage = state.splits[0];
+	highpage = state.splits[state.nsplits - 1];
+
+	/* Give split points a fillfactormult-wise delta, and sort on deltas */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Determine if default strategy/split interval will produce a
+	 * sufficiently distinguishing split, or if we should change strategies.
+	 * Alternative strategies change the range of split points that are
+	 * considered acceptable (split interval), and possibly change
+	 * fillfactormult, in order to deal with pages with a large number of
+	 * duplicates gracefully.
+	 *
+	 * Pass low and high splits for the entire page (including even newitem).
+	 * These are used when the initial split interval encloses split points
+	 * that are full of duplicates, and we need to consider if it's even
+	 * possible to avoid appending a heap TID.
+	 */
+	perfectpenalty = _bt_strategy(&state, &lowpage, &highpage, &strategy);
+
+	if (strategy == SPLIT_DEFAULT)
+	{
+		/*
+		 * Default strategy worked out (always works out with internal page).
+		 * Original split interval still stands.
+		 */
+	}
+
+	/*
+	 * Many duplicates strategy is used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates.
+	 *
+	 * The split interval is widened to include all legal candidate split
+	 * points.  There may be a few as two distinct values in the whole-page
+	 * split interval.  Many duplicates strategy has no hard requirements for
+	 * space utilization, though it still keeps the use of space balanced as a
+	 * non-binding secondary goal (perfect penalty is set so that the
+	 * first/lowest delta split points that's minimally distinguishing is
+	 * used).
+	 *
+	 * Single value strategy is used when it is impossible to avoid appending
+	 * a heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently.  (Single value strategy is harmless though not
+	 * particularly useful with !heapkeyspace indexes.)
+	 */
+	else if (strategy == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.interval = state.nsplits;
+	}
+	else if (strategy == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split near the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		/* Appending a heap TID is unavoidable, so interval of 1 is fine */
+		state.interval = 1;
+	}
+
+	/*
+	 * Search among acceptable split points (using final split interval) for
+	 * the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(&state, perfectpenalty, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int16		leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int16) (firstrightitemsz +
+							 MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int16) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int16) state->newitemsz;
+	else
+		rightfree -= (int16) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int16) firstrightitemsz -
+			(int16) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		Assert(state->nsplits < state->maxsplits);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int16		delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty among split points
+ * that fall within current/final split interval.  Penalty is an abstract
+ * score, with a definition that varies depending on whether we're splitting a
+ * leaf page or an internal page.  See _bt_split_penalty() for details.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice (or when we only want a
+ * minimally distinguishing split point, and don't want to make the split any
+ * more unbalanced than is necessary).
+ *
+ * We return the index of the first existing tuple that should go on the right
+ * page, plus a boolean indicating if new item is on left of split point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(FindSplitData *state, int perfectpenalty, bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->interval, state->nsplits);
+
+	/* No point in calculating penalty when there's only one choice */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(state, state->splits + i);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to decide whether split should use default strategy/initial
+ * split interval, or whether it should finish splitting the page using
+ * alternative strategies (this is only possible with leaf pages).
+ *
+ * Caller uses alternative strategy (or sticks with default strategy) based
+ * on how *strategy is set here.  Return value is "perfect penalty", which is
+ * passed to _bt_bestsplitloc() as a final constraint on how far caller is
+ * willing to go to avoid appending a heap TID in many duplicates mode (it
+ * also saves _bt_bestsplitloc() useless cycles).
+ */
+static int
+_bt_strategy(FindSplitData *state, SplitPoint *lowpage, SplitPoint *highpage,
+			 FindSplitStrat *strategy)
+{
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *lowinterval,
+			   *highinterval;
+	int			perfectpenalty;
+	int			highsplit = Min(state->interval, state->nsplits);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume that alternative strategy won't be used for now */
+	*strategy = SPLIT_DEFAULT;
+
+	/*
+	 * There is no way to know what the smallest tuple could be efficiently
+	 * with internal pages, though inspecting the size of the tuples is cheap,
+	 * so caller is exhaustive in search for smallest pivot within interval.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * Use leftmost and rightmost tuples within current acceptable range of
+	 * split points (using current split interval)
+	 */
+	lowinterval = state->splits;
+	highinterval = state->splits + highsplit - 1;
+	leftmost = _bt_split_lastleft(state, lowinterval);
+	rightmost = _bt_split_firstright(state, highinterval);
+
+	/*
+	 * If initial split interval can produce a split point that is minimally
+	 * distinguishing (that avoids the need to append a heap TID to new
+	 * pivot), we're done.  Finish split with default strategy.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+		return perfectpenalty;
+
+	/*
+	 * Work out how caller should finish split when even their "perfect"
+	 * penalty fails to avoid appending a heap TID to new pivot tuple.
+	 *
+	 * Use the leftmost split's lastleft tuple and the rightmost split's
+	 * firstright tuple to assess the whole page (including new item).
+	 */
+	leftmost = _bt_split_lastleft(state, lowpage);
+	rightmost = _bt_split_firstright(state, highpage);
+
+	/*
+	 * If page (including new item) has many duplicates but is not entirely
+	 * full of duplicates, a many duplicates strategy split will be performed
+	 * (caller will find and use minimally distinguishing split point).  If
+	 * page is entirely full of duplicates, a single value strategy split will
+	 * be performed (caller accepts that it must append heap TID, but splits
+	 * towards the end of the page).
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+	{
+		*strategy = SPLIT_MANY_DUPLICATES;
+
+		/*
+		 * Caller should settle on lowest delta split that manages to avoid
+		 * appending a heap TID -- caller should not try for a lower penalty,
+		 * now that it's not using original/small split interval
+		 */
+		return indnkeyatts;
+	}
+
+	/*
+	 * Single value strategy is only appropriate with ever-increasing heap
+	 * TIDs; otherwise, original default strategy split should proceed to
+	 * avoid pathological performance.  Use page high key to infer if this is
+	 * the rightmost page among pages that store the same duplicate value.
+	 * This should not prevent insertions of heap TIDs that are slightly out
+	 * of order from using single value strategy, since that's expected with
+	 * concurrent inserters of the same duplicate value.
+	 */
+	else if (state->is_rightmost)
+		*strategy = SPLIT_SINGLE_VALUE;
+	else
+	{
+		ItemId		itemid;
+		IndexTuple	hikey;
+
+		itemid = PageGetItemId(state->page, P_HIKEY);
+		hikey = (IndexTuple) PageGetItem(state->page, itemid);
+		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
+											 state->newitem);
+		if (perfectpenalty <= indnkeyatts)
+			*strategy = SPLIT_SINGLE_VALUE;
+		else
+		{
+			/*
+			 * Have caller finish split using default strategy, since page
+			 * does not appear to be the rightmost page for duplicates of the
+			 * value the page is filled with
+			 */
+		}
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.
+ */
+static inline int
+_bt_split_penalty(FindSplitData *state, SplitPoint *split)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	firstrighttuple = _bt_split_firstright(state, split);
+
+	if (!state->is_leaf)
+		return IndexTupleSize(firstrighttuple);
+
+	lastleftuple = _bt_split_lastleft(state, split);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(state->rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page,
+						   OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 15090b26d2..146de1b2e4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2318,6 +2319,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9332bf4086..18068a8c1a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -159,11 +159,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -708,6 +712,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -774,6 +785,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v14-0004-Add-split-after-new-tuple-optimization.patchapplication/octet-stream; name=v14-0004-Add-split-after-new-tuple-optimization.patchDownload

From 9cb5f0d9f275ddd41b746f7636eb357d31095df4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v14 4/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values.  When this insertion pattern is detected, page splits split just
after the new item that provoked a page split (or apply leaf fillfactor
in the style of a rightmost page split).

Without this patch, affected cases will reliably leave leaf pages no
more than about 50% full, without future insertions ever making use of
the free space left behind.  50/50 page splits are only appropriate with
a pattern of truly random insertions, where the average space
utilization ends up at 65% - 70%.  This patch addresses the worst case
for space utilization, where leaf pages are unusually sparsely filled
despite the fact that there are never any dead tuples.

The optimization is very similar to the long established fillfactor
optimization used during rightmost page splits, where we usually leave
the new left side of the split 90% full.  Split-after-new-tuple page
splits target essentially the same case.  The splits targeted are those
at the rightmost point of a localized grouping of values, rather than
those at the rightmost point of the entire key space.  Localized
monotonically increasing insertion patterns are presumed to be fairly
common in real-world applications.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsplitloc.c | 227 +++++++++++++++++++++++-
 1 file changed, 224 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index cd24480634..7fed1c151e 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -69,6 +69,9 @@ static void _bt_recsplitloc(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
 				 bool *newitemonleft);
 static int _bt_strategy(FindSplitData *state, SplitPoint *lowpage,
@@ -245,9 +248,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default strategy), while applying
-	 * rightmost where appropriate.  Either of the two other fallback
-	 * strategies may be required for cases with a large number of duplicates
-	 * around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback strategies may be required for cases
+	 * with a large number of duplicates around the original/space-optimal
+	 * split point.
 	 *
 	 * Default strategy gives some weight to suffix truncation in deciding a
 	 * split point on leaf pages.  It attempts to select a split point where a
@@ -269,6 +273,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization, either by
+		 * applying leaf fillfactor multiplier, or by choosing the exact split
+		 * point that leaves the new item as last on the left. (usemult is set
+		 * for us.)
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -512,6 +554,185 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in monotonically
+ * increasing order, but that shouldn't matter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by 50:50/!usemult
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index (i.e. typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult)
+{
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(!state->is_rightmost);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (state->newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (state->newitemsz >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
+		sizeof(ItemIdData))
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples.  Surmise that
+	 * page has equisized tuples when page layout is consistent with having
+	 * maxoff-1 non-pivot tuples that are all the same size as the newly
+	 * inserted tuple (note that the possibly-truncated high key isn't counted
+	 * in olddataitemstotal).
+	 */
+	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (state->newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(state->page, maxoff);
+		tup = (IndexTuple) PageGetItem(state->page, itemid);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, but is deemed suitable for the
+	 * optimization, caller splits at the point immediately after the would-be
+	 * position of the new item, and immediately before the item after the new
+	 * item.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
+	tup = (IndexTuple) PageGetItem(state->page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+	/*
+	 * Don't allow caller to split after a new item when it will result in a
+	 * split point to the right of the point that a leaf fillfactor split
+	 * would use -- have caller apply leaf fillfactor instead.  There is no
+	 * advantage to being very aggressive in any case.  It may not be legal to
+	 * split very close to maxoff.
+	 */
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v14-0005-Add-high-key-continuescan-optimization.patchapplication/octet-stream; name=v14-0005-Add-high-key-continuescan-optimization.patchDownload

From fa72a1c23470317acb7d4c50f5dc052c81fb4a1c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v14 5/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2999971cfd..aab6a7399d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1364,6 +1364,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1412,16 +1413,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 146de1b2e4..7c795c6bb6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1362,11 +1362,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1376,6 +1379,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1389,24 +1393,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1418,6 +1419,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1429,11 +1431,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1564,8 +1579,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1582,6 +1597,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v14-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchapplication/octet-stream; name=v14-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchDownload

From 8d6a512d91099e7ba298734ba3de858ad89813de Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v14 2/7] Make heap TID a tie-breaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.  The new maximum allowed size is 2704 bytes on 64-bit
systems, down from 2712 bytes.
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 344 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 160 ++++---
 src/backend/access/nbtree/nbtinsert.c        | 326 +++++++++-----
 src/backend/access/nbtree/nbtpage.c          | 196 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 103 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 433 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 215 +++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 28 files changed, 1611 insertions(+), 530 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 053ac9d192..0a005afa34 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -137,17 +141,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_minusinfkey(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -204,6 +213,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -254,7 +264,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -324,8 +336,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -346,6 +358,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -806,7 +819,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -839,6 +853,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -865,7 +880,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -906,7 +922,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_minusinfkey(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -940,9 +1005,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -968,11 +1059,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1035,7 +1125,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1213,9 +1303,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1304,7 +1394,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_minusinfkey(state->rel, firstitup);
 }
 
 /*
@@ -1367,7 +1457,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1416,14 +1507,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1855,6 +1961,66 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  It is even capable of determining that a "minus
+	 * infinity value" from a "minusinfkey" scankey is equal to a pivot's
+	 * truncated attribute.  However, it is not capable of determining that a
+	 * scankey ("minusinfkey" or otherwise) is _less than_ a tuple on the
+	 * basis of a comparison resolved at _scankey_ minus infinity attribute.
+	 *
+	 * Somebody could teach _bt_compare() to handle this on its own, but that
+	 * would add overhead to index scans.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1874,42 +2040,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2065,3 +2273,61 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically sets insertion scankey to have
+ * minus infinity values for truncated attributes from itup (when itup is a
+ * pivot tuple with one or more truncated attributes).
+ *
+ * In a non-corrupt heapkeyspace index, all pivot tuples on a level have
+ * unique keys, so the !minusinfkey optimization correctly guides scans that
+ * aren't interested in relocating a leaf page using leaf page's high key
+ * (i.e. optimization can safely be used by the vast majority of all
+ * _bt_search() calls).  nbtree verification should always use "minusinfkey"
+ * semantics, though, because the !minusinfkey optimization might mask a
+ * problem in a corrupt index.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !minusinfkey tie-breaker might otherwise
+ * cause amcheck to conclude that the scankey is greater, missing index
+ * corruption.  It's unlikely that the same problem would not be caught some
+ * other way, but the !minusinfkey optimization has no upside for amcheck, so
+ * it seems sensible to always avoid it.
+ */
+static inline BTScanInsert
+bt_mkscankey_minusinfkey(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->minusinfkey = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a..cb23be859d 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89..9c0b4718b6 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -595,36 +606,56 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
+insertion scankey uses a similar array-of-ScanKey data structure, but the
 sk_func pointers point to btree comparison support functions (ie, 3-way
 comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey there is at most one entry per index column.  There is
+also other data about the rules used to locate where to begin the scan,
+such as whether or not the scan is a "nextkey" scan.  Insertion scankeys
+are built within the btree code (eg, by _bt_mkscankey()) and are used to
+locate the starting point of a scan, as well as for locating the place to
+insert a new index tuple.  (Note: in the case of an insertion scankey built
+from a search scankey or built from a truncated pivot tuple, there might be
+fewer keys than index columns, indicating that we have no constraints for
+the remaining index columns.) After we have located the starting point of a
+scan, the original search scankey is consulted as each index entry is
+sequentially scanned to decide whether to return the entry and whether the
+scan can stop (see _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -637,20 +668,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -658,4 +695,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f143ea8be2..818683ac2e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -120,6 +122,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -225,12 +230,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -267,6 +273,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -275,12 +285,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
@@ -291,8 +301,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
-					   false);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
@@ -313,7 +323,8 @@ top:
  *
  * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
  * reuse most of the work of our initial binary search to find conflicting
- * tuples.
+ * tuples.  This won't be usable if caller's tuple is determined to not belong
+ * on buf following scantid being filled-in.
  *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
@@ -362,6 +373,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(itup_key->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -399,16 +411,14 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive. Just advance over killed items
+			 * as quickly as we can. We only apply _bt_isequal() when we get
+			 * to a non-killed item. Even those comparisons could be avoided
+			 * (in the common case where there is only one page to visit) by
+			 * reusing bounds, but just skipping dead items is sufficiently
+			 * effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -633,16 +643,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple belongs on.
+ *		Occasionally, this won't be exactly right for callers that just
+ *		called _bt_check_unique(), and did initial search without using a
+ *		scantid.  They'll have to insert into a page somewhere to the right
+ *		in rare cases where there are many physical duplicates in a unique
+ *		index, and their scantid directs us to some page full of duplicates
+ *		to the right, where the new tuple must go.  (Actually, since
+ *		!heapkeyspace pg_upgraded'd non-unique indexes never get a scantid,
+ *		they too may require that we move right.  We treat them somewhat like
+ *		unique indexes.)
  *
  *		_bt_check_unique() callers arrange for their insertion scan key to
  *		save the progress of the last binary search performed.  No additional
@@ -685,28 +695,26 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -714,6 +722,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * The checkingunique (restorebinsrch) case may well have established
 		 * bounds within _bt_check_unique()'s binary search that preclude the
 		 * need for a further high key check.  This fastpath isn't used when
@@ -721,22 +736,33 @@ _bt_findinsertloc(Relation rel,
 		 * when it looks like the new item belongs last on the page, but it
 		 * might go on a later page instead.
 		 */
-		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
-			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+				 itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -745,6 +771,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -814,9 +842,17 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
  *		This function handles the question of whether or not an insertion
- *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
- *		insert on the page contained in buf when a choice must be made.
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should insert
+ *		on the page contained in buf when a choice must be made.  It is only
+ *		used with pg_upgrade'd version 2 and version 3 indexes (!heapkeyspace
+ *		indexes).
+ *
  *		Preemptive microvacuuming is performed here when that could allow
  *		caller to insert on to the page in buf.
  *
@@ -904,6 +940,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -926,7 +963,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -976,8 +1013,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1059,7 +1096,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1181,17 +1220,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1286,7 +1327,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1299,9 +1341,30 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
+	 * The "high key" for the new left page will be the first key that's
+	 * going to go into the new right page, or possibly a truncated version
+	 * if this is a leaf page split.  This might be either the existing data
 	 * item at position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach
+	 * of using the left page's last item as its new high key when splitting
+	 * on the leaf level.  It isn't, though: suffix truncation will leave
+	 * the left page's high key fully equal to the last item on the left
+	 * page when two tuples with equal key values (excluding heap TID)
+	 * enclose the split point.  It isn't actually necessary for a new leaf
+	 * high key to be equal to the last item on the left for the L&Y
+	 * "subtree" invariant to hold.  It's sufficient to make sure that the
+	 * new leaf high key is strictly less than the first item on the right
+	 * leaf page, and greater than or equal to (i.e. not necessarily equal
+	 * to) the last item on the left leaf page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap
+	 * TID from lastleft will be used as the new high key, since the last
+	 * left tuple could be physically larger despite being opclass-equal in
+	 * respect of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1319,25 +1382,58 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate unneeded key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
 	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * though; truncating unneeded key suffix attributes can only be
+	 * performed at the leaf level anyway.  This is because a pivot tuple in
+	 * a grandparent page must guide a search not only to the correct parent
+	 * page, but also to the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  The
+		 * last left tuple and the first right tuple enclose the split point,
+		 * and are needed to determine how far truncation can go while still
+		 * leaving us with a high key that distinguishes the left side from
+		 * the right side.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		/*
+		 * Truncate first item on the right side to create a new high key for
+		 * the left side.  The high key must be strictly less than all tuples
+		 * on the right side of the split, but can be equal to the last item
+		 * on the left side of the split within leaf pages.
+		 */
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1530,7 +1626,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1559,22 +1654,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1590,9 +1673,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1955,7 +2036,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1985,7 +2066,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2246,7 +2327,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2281,7 +2362,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2314,6 +2396,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2378,6 +2462,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2393,8 +2478,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2407,6 +2492,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..72af1ef3c1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1422,6 +1494,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
+				/* absent attributes need explicit minus infinity values */
+				itup_key->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7940297305..2999971cfd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -358,6 +364,9 @@ _bt_binsrch(Relation rel,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	if (!key->restorebinsrch)
 	{
 		low = P_FIRSTDATAKEY(opaque);
@@ -367,6 +376,7 @@ _bt_binsrch(Relation rel,
 	else
 	{
 		/* Restore result of previous binary search against same page */
+		Assert(!key->heapkeyspace || key->scantid != NULL);
 		Assert(P_ISLEAF(opaque));
 		low = key->low;
 		high = key->stricthigh;
@@ -446,6 +456,7 @@ _bt_binsrch(Relation rel,
 	if (key->savebinsrch)
 	{
 		Assert(isleaf);
+		Assert(key->scantid == NULL);
 		key->low = low;
 		key->stricthigh = stricthigh;
 		key->savebinsrch = false;
@@ -492,19 +503,31 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(key->minusinfkey || key->heapkeyspace);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples is not explicitly
+	 * represented as 0 within btree version 2 indexes, so an explicit offnum
+	 * test is still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -518,8 +541,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -570,8 +595,65 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.  (This is also a convenient point to check if
+	 * the !minusinfkey optimization can be used.)
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches (all !minusinfkey searches) are not interested in
+		 * keys where minus infinity is explicitly represented, since that's a
+		 * sentinel value that never appears in non-pivot tuples.  It is safe
+		 * for these searches to have their scankey considered greater than a
+		 * truncated pivot tuple iff the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in pivot tuple.  The only would-be "match" that will be
+		 * "missed" is a single leaf page's high key (the leaf page whose high
+		 * key the values from affected pivot tuple originate from).
+		 *
+		 * This optimization prevents an extra leaf page visit when the
+		 * insertion scankey would otherwise be equal.  If this tiebreaker
+		 * wasn't performed, code like _bt_readpage() and _bt_readnextpage()
+		 * would often end up moving right having found no matches on the leaf
+		 * page that their search lands on initially.
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 */
+		if (!key->minusinfkey && key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1088,7 +1170,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scan key fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..67cdb44cf5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e010bcdcfa..15090b26d2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's minusinfkey field.  This can be
+ *		thought of as explicitly representing that absent attributes in scan
+ *		key have minus infinity values.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,15 +97,38 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate high keys in heapkeyspace indexes.  Caller also
+	 * opts out of the "no minus infinity key" optimization, so search moves
+	 * left on scankey-equal downlink in parent, allowing VACUUM caller to
+	 * reliably relocate leaf page undergoing deletion.
+	 */
+	key->minusinfkey = !key->heapkeyspace;
 	key->savebinsrch = key->restorebinsrch = false;
 	key->low = key->stricthigh = InvalidOffsetNumber;
 	key->nextkey = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -103,9 +144,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are defensively
-		 * represented as NULL values, though they should still not
-		 * participate in comparisons.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values,
+		 * though they should still not participate in comparisons.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2043,38 +2084,238 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no explicit pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 *
+	 * (We could just use all of lastleft instead, but that would complicate
+	 * caller's free space accounting, which makes the assumption that the new
+	 * pivot must be no larger than firstright plus a MAXALIGN()'d item
+	 * pointer.)
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only value that's legally usable.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 3 tuples across
+	 * Postgres versions; don't allow new pivot tuples to have truncated key
+	 * attributes there.  This keeps things consistent and simple for
+	 * verification tools that have to handle multiple versions.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2088,15 +2329,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2116,16 +2359,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2135,8 +2388,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2148,7 +2408,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2159,18 +2423,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 950a61958d..9332bf4086 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,44 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree version 3, so that they don't need to
+ * be immediately re-indexed at pg_upgrade.  In order to get the new
+ * heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is the same as version 3, except for two new fields in
+ * the metapage that were introduced in version 3.  A version 2 metapage
+ * will be automatically upgraded to version 3 on the first insert to it.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +213,71 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
+ * tuples (non-pivot tuples).
  *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * Non-pivot tuple format:
  *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tie-breaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set.
+ *
+ * All other types of index tuples (collectively, "pivot" tuples) only
+ * have key columns, since pivot tuples only ever need to represent how
+ * the key space is separated.  In general, any B-Tree index that has more
+ * than one level (i.e. any index that does not just consist of a metapage
+ * and a single leaf root page) must have some number of pivot tuples,
+ * since pivot tuples are used for traversing the tree.  Suffix truncation
+ * can omit trailing key columns when a new pivot is formed, which makes
+ * minus infinity their logical value.  Since BTREE_VERSION 4 indexes
+ * treat heap TID as a trailing key columns that ensures that all index
+ * tuples are unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set.  In
+ * that case, the number key columns is implicitly the same as the number
+ * of key columns in the index.  It is never set on version 2 indexes,
+ * which predate the introduction of INCLUDE indexes. (INCLUDE indexes are
+ * the only indexes that use INDEX_ALT_TID_MASK on version 3.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +300,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +318,52 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Only BTREE_VERSION 4 indexes treat heap TID as a tie-breaker key attribute.
+ * This macro can be used with tuples from indexes that use earlier versions,
+ * even though the result won't be meaningful.  The expectation is that higher
+ * level code will ensure that the result is never used, for example by never
+ * providing a scantid that the result is compared against.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -326,25 +429,55 @@ typedef BTStackData *BTStack;
  * _bt_search.  For details on its mutable state, see _bt_binsrch and
  * _bt_findinsertloc.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.
+ * Searches that are not specifically interested in keys with the value minus
+ * infinity (all searches bar those performed by VACUUM for page deletion)
+ * apply the optimization by setting the field to false.  The optimization
+ * avoids unnecessarily reading the left sibling of the leaf page that
+ * matching tuples can appear on first.  Work is saved when the insertion
+ * scankey happens to search on all the untruncated "separator" key attributes
+ * for some pivot tuple, without also providing a key value for a remaining
+ * truncated-in-pivot-tuple attribute.  Reasoning about minus infinity values
+ * specifically allows this case to use a special tie-breaker, guiding search
+ * right instead of left on the next level down.  This is particularly likely
+ * to help in the common case where insertion scankey has no scantid but has
+ * values for all other attributes, especially with indexes that happen to
+ * have few distinct values (once heap TID is excluded) on each leaf page.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +485,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +718,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +772,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..6320a0098f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
-- 
2.17.1

v14-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchapplication/octet-stream; name=v14-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From 3ee2cf2e2078525c8ae6432ebdabefba78f0e37c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v14 6/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 157 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 181 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0a005afa34..ee97463894 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -74,6 +74,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -123,10 +125,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -139,6 +142,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -176,7 +180,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -195,11 +199,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -208,7 +215,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -266,7 +274,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -337,7 +345,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -361,6 +369,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -429,6 +438,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -921,6 +938,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_minusinfkey(state->rel, itup);
 
@@ -1525,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1925,6 +1971,101 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ *
+ * Alternative nbtree design that could be used perform "cousin verification"
+ * inexpensively/transitively (may make current approach clearer):
+ *
+ * A cousin leaf page has a lower bound that comes from its grandparent page
+ * rather than its parent page, as already discussed (note also that a "second
+ * cousin" leaf page gets its lower bound from its great-grandparent, and so
+ * on).  Any pivot tuple in the root page after the first tuple (which is an
+ * "absolute" negative infinity tuple, since its leftmost on the level) should
+ * separate every leaf page into <= and > pages.  Even with the current
+ * design, there should be an unbroken seam of identical-to-root-pivot high
+ * key separator values at the right edge of the <= pages, all the way down to
+ * (and including) the leaf level.  Recall that page deletion won't delete the
+ * rightmost child of a parent page unless the child page is the only child,
+ * in which case the parent is deleted with the child.
+ *
+ * If we didn't truncate the item at first/negative infinity offset to zero
+ * attributes during internal page splits then there would also be an unbroken
+ * seam of identical-to-root-pivot "low key" separator values on the left edge
+ * of the pages that are > the separator value; this alternative design would
+ * allow us to verify the same invariants directly, without ever having to
+ * cross more than one level of the tree (we'd still have to cross one level
+ * because leaf pages would still not store a low key directly, and we'd still
+ * need bitwise-equality cross checks of downlink separator in parent against
+ * the low keys in their non-leaf children).
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	/* No need to use bt_mkscankey_minusinfkey() here */
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal certain remaining inconsistencies.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		OffsetNumber offnum;
+		Page		page;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch(state->rel, key, lbuf);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v14-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v14-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From ac5b51e2204596887c59aa57cf8122e22c62f514 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v14 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

#62

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#60)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Some comments on
v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
Mostly about code comments. In general, I think a round of copy-editing
the comments, to use simpler language, would do good. The actual code
changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

Seems confusing to first say assertively that "*bufptr contains the page
that the new tuple unambiguously belongs to", and then immediately go on
to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

This happens very seldom, because you only get an incomplete split if
you crash in the middle of a page split, and that should be very rare. I
don't think we need to list more fine-grained conditions here, that just
confuses the reader.

Fixed in v14.

/*
* _bt_useduplicatepage() -- Settle for this page of duplicates?

So, this function is only used for legacy pg_upgraded indexes. The
comment implies that, but doesn't actually say it.

I made that more explicit in v14.

/*
* Get tie-breaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.

#define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

The comment claims that "all pivot tuples must be as of BTREE_VERSION
4". I thought that all internal tuples are called pivot tuples, even on
version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

I think what this means to say is that this macro is only
used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

This macro (and many others in nbtree.h) is quite complicated. A static
inline function might be easier to read.

I agree that the macros are complicated, but that seems to be because
the rules are complicated. I'd rather leave the macros in place, and
improve the commentary on the rules.

'xlmeta.version' is set incorrectly.

Oops. Fixed in v14.

I find this comment difficult to read. I suggest rewriting it to:

/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.

I used your wording for this in v14, almost verbatim.

Now that the index tuple format becomes more complicated, I feel that
there should be some kind of an overview explaining the format. All the
information is there, in the comments in nbtree.h, but you have to piece
together all the details to get the overall picture. I wrote this to
keep my head straight:

v14 uses your diagrams in nbtree.h, and expands some existing
discussion of INCLUDE indexes/non-key attributes/tuple format. Let me
know what you think.

--
Peter Geoghegan

#63

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#61)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

I'm looking at the first patch in the series now. I'd suggest that you
commit that very soon. It's useful on its own, and seems pretty much
ready to be committed already. I don't think it will be much affected by
whatever changes we make to the later patches, anymore.

I did some copy-editing of the code comments, see attached patch which
applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch.
Mostly, to use more Plain English: use active voice instead of passive,
split long sentences, avoid difficult words.

I also had a few comments and questions on some details. I added them in
the same patch, marked with "HEIKKI:". Please take a look.

- Heikki

Attachments:

v14-0001-Refactor-nbtree-insertion-scankeys-HEIKKI-comments.patchtext/x-patch; name=v14-0001-Refactor-nbtree-insertion-scankeys-HEIKKI-comments.patchDownload

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89a..eb4df2ebbe6 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -609,6 +609,9 @@ original search scankey is consulted as each index entry is sequentially
 scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
+HEIKKI: The above probably needs some updating, now that we have a
+separate BTScanInsert struct to represent an insertion scan key.
+
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
 all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b3fbba276dd..2a2d6576060 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -97,9 +97,12 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
  *		UNIQUE_CHECK_EXISTING) it will throw error for a duplicate.
  *		For UNIQUE_CHECK_EXISTING we merely run the duplicate check, and
- *		don't actually insert.  If rel is a unique index, then every call
- *		here is a checkingunique call (i.e. every call does a duplicate
- *		check, though perhaps only a tentative check).
+ *		don't actually insert.
+
+HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
+weird to talk about it in the function comment. I didn't understand what
+the point of adding this sentence was, so I removed it.
+
  *
  *		The result value is only significant for UNIQUE_CHECK_PARTIAL:
  *		it must be true if the entry is known unique, else false.
@@ -285,9 +288,10 @@ top:
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
 		/*
-		 * Do the insertion.  Note that itup_key contains mutable state used
-		 * by _bt_check_unique to help _bt_findinsertloc avoid repeating its
-		 * binary search.  !checkingunique case must start own binary search.
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
@@ -311,10 +315,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * Sets state in itup_key sufficient for later _bt_findinsertloc() call to
- * reuse most of the work of our initial binary search to find conflicting
- * tuples.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -326,6 +326,10 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in itup_key that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
 _bt_check_unique(Relation rel, BTScanInsert itup_key,
@@ -352,8 +356,8 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	/*
-	 * Save binary search bounds.  Note that this is also used within
-	 * _bt_findinsertloc() later.
+	 * Save binary search bounds.  We use them in the fastpath below, but
+	 * also in the _bt_findinsertloc() call later.
 	 */
 	itup_key->savebinsrch = true;
 	offset = _bt_binsrch(rel, itup_key, buf);
@@ -375,16 +379,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 		if (offset <= maxoff)
 		{
 			/*
-			 * Fastpath: _bt_binsrch() search bounds can be used to limit our
-			 * consideration to items that are definitely duplicates in most
-			 * cases (though not when original page is empty, or when initial
-			 * offset is past the end of the original page, which may indicate
-			 * that we'll have to examine a second or subsequent page).
+			 * Fastpath: In most cases, we can use _bt_binsrch search bounds
+			 * to limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply, when the original
+			 * page is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
 			 *
 			 * Note that this optimization avoids calling _bt_isequal()
-			 * entirely when there are no duplicates, provided initial offset
-			 * isn't past end of the initial page (and provided page has at
-			 * least one item).
+			 * entirely when there are no duplicates, as long as the location
+			 * where the key would belong to is not at the end of the page.
 			 */
 			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
 			{
@@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 			if (P_RIGHTMOST(opaque))
 				break;
 			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+			/*
+			 * HEIKKI: This assertion might fire if the user-defined opclass
+			 * is broken. It's just an assertion, so maybe that's ok. With a
+			 * broken opclass, it's obviously "garbage in, garbage out", but
+			 * we should try to behave sanely anyway. I don't remember what
+			 * our general policy on that is; should we assert, elog(ERROR),
+			 * or continue silently in that case? An elog(ERROR) or
+			 * elog(WARNING) would feel best to me, but I don't remember what
+			 * we usually do.
+			 */
 			Assert(highkeycmp <= 0);
 			if (highkeycmp != 0)
 				break;
@@ -644,30 +659,55 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		_bt_check_unique() callers arrange for their insertion scan key to
- *		save the progress of the last binary search performed.  No additional
- *		binary search comparisons occur in the common case where there was no
- *		existing duplicate tuple, though we may occasionally still not be able
- *		to reuse their work for our own reasons.  Even when there are garbage
- *		duplicates, very few binary search comparisons will be performed
- *		without being strictly necessary.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, in the insertion scan key.  In the common case that there
+ *		were no duplicates, we don't need to do any additional binary search
+ *		comparisons here.  Though occasionally, we may still not be able to
+ *		reuse the saved state for our own reasons. Even when there are garbage
+ *		duplicates, we do very few binary search comparisons that are not
+ *		strictly necessary.
  *
- *		The caller should hold an exclusive lock on *bufptr in all cases.  On
- *		exit,  bufptr points to the chosen insert location in all cases.  If
- *		we have to move right, the lock and pin on the original page will be
- *		released, and the new page returned to the caller is exclusively
- *		locked instead.  In any case, we return the offset that caller should
- *		use to insert into the buffer pointed to by bufptr on return.
+HEIKKI:
+
+Should we mention explicitly that this binary-search reuse is only applicable
+if unique checks were performed? It's kind of implied by the fact that it's
+_bt_check_unique() that saves the state, but perhaps we should be more clear
+about it.
+
+What is a "garbage duplicate"? Same as a "dead duplicate"?
+
+The last sentence, about garbage duplicates, seems really vague. Why do we
+ever do any comparisons that are not strictly necessary? Perhaps it's best to
+just remove that last sentence.
+
+ *
+ *		On entry, *bufptr points to the first legal page where the new tuple
+ *		could be inserted.  The caller must hold an exclusive lock on *bufptr.
+ *
+ *		On exit, *bufptr points to the chosen insertion page, and the offset
+ *		within that page is returned.  If _bt_findinsertloc decides to move
+ *		right, the lock and pin on the original page is released, and the new
+ *		page returned to the caller is exclusively locked instead.
  *
  *		This is also where opportunistic microvacuuming of LP_DEAD tuples
  *		occurs.  It has to happen here, since it may invalidate a
  *		_bt_check_unique() caller's cached binary search work.
+
+HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
+could easily imagine a separate function that does microvacuuming, and resets
+(or even updates) the binary-search cache in the insertion key. I agree this
+is a convenient place to do it, though.
+
  */
 static OffsetNumber
 _bt_findinsertloc(Relation rel,
 				  BTScanInsert itup_key,
 				  Buffer *bufptr,
 				  bool checkingunique,
+/* HEIKKI:
+Do we need 'checkunique' as an argument? If unique checks were not
+performed, the insertion key will simply not have saved state.
+*/
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
@@ -706,6 +746,30 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
+	/* HEIKKI: I liked this comment that we used to have here, before this patch: */
+	/*----------
+	 * If we will need to split the page to put the item on this page,
+	 * check whether we can put the tuple somewhere to the right,
+	 * instead.  Keep scanning right until we
+	 *		(a) find a page with enough free space,
+	 *		(b) reach the last page where the tuple can legally go, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  Currently the probability of moving right is set at 0.99,
+	 * which may seem too high to change the behavior much, but it does an
+	 * excellent job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	/* HEIKKI: Maybe it's not relevant with the later patches, but at least
+	 * with just this first patch, it's still valid. I noticed that the
+	 * comment is now in _bt_useduplicatepage, it seems a bit out-of-place
+	 * there. */
+
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
 	for (;;)
 	{
@@ -714,29 +778,35 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
-		 * The checkingunique (restorebinsrch) case may well have established
-		 * bounds within _bt_check_unique()'s binary search that preclude the
-		 * need for a further high key check.  This fastpath isn't used when
-		 * there are no items on the existing page (other than high key), or
-		 * when it looks like the new item belongs last on the page, but it
-		 * might go on a later page instead.
+		 * An earlier _bt_check_unique() call may well have saved bounds that
+		 * we can use to skip the high key check.  This fastpath cannot be
+		 * used when there are no items on the existing page (other than the
+		 * high key), or when it looks like the new item belongs on the page
+		 * but it might go on a later page instead.
 		 */
 		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
 			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
+		/*
+		 * If this is the last page that the tuple can legally go to, stop
+		 * here.
+		 */
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
+		if (cmpval != 0)
+			break;
 
 		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
+		 * Otherwise, we have a choice to insert here, or move right to a
+		 * later page.  Try to balance space utilization the best we can.
 		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
+		if (_bt_useduplicatepage(rel, heapRel, buf, &restorebinsrch, itemsz))
+		{
+			/* decided to insert here */
 			break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -786,9 +856,17 @@ _bt_findinsertloc(Relation rel,
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
 	/*
-	 * Perform microvacuuming of the page we're about to insert tuple on if it
-	 * looks like it has LP_DEAD items.  Only microvacuum when it's likely to
-	 * forestall a page split, though.
+	 * If the page we're about to insert to doesn't have enough room for the
+	 * new tuple, we will have to split it.  If it looks like the page has
+	 * LP_DEAD items, try to remove them, in hope of making room for the new
+	 * item and avoiding the split.
+
+HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would
+try to microvacuum the page twice: first in _bt_useduplicatepage, and second
+time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if
+there are in fact no LP_DEAD items. That's probably insignificant and not worth
+worrying about, but I thought I'd mention it.
+
 	 */
 	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
 	{
@@ -814,15 +892,15 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
- *		This function handles the question of whether or not an insertion
- *		of a duplicate into a pg_upgrade'd !heapkeyspace index should
- *		insert on the page contained in buf when a choice must be made.
- *		Preemptive microvacuuming is performed here when that could allow
- *		caller to insert on to the page in buf.
+ *		If we have the choice to insert to current page, or to some later
+ *		later page to the right, this function decides what to do.
  *
- *		Returns true if caller should proceed with insert on buf's page.
- *		Otherwise, caller should move on to the page to the right (caller
- *		must always be able to still move right following call here).
+ *		If the current page doesn't have enough free space for the new
+ *		tuple, we "microvacuum" the page, removing LP_DEAD items, in
+ *		hope that it will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on the current
+ *		page.  Otherwise, caller should move on to the page to the right.
  */
 static bool
 _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
@@ -838,6 +916,10 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 	if (PageGetFreeSpace(page) >= itemsz)
 		return true;
 
+	/*
+	 * Before considering moving right, see if we can obtain enough space by
+	 * erasing LP_DEAD items.
+	 */
 	if (P_HAS_GARBAGE(lpageop))
 	{
 		_bt_vacuum_one_page(rel, buf, heapRel);
@@ -1275,9 +1357,11 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	 * If the page we're splitting is not the rightmost page at its level in
 	 * the tree, then the first entry on the page is the high key for the
 	 * page.  We need to copy that to the right half.  Otherwise (meaning the
-	 * rightmost page case), all the items on the right half will be user data
-	 * (there is no existing high key that needs to be relocated to the new
-	 * right page).
+	 * rightmost page case), all the items on the right half will be user
+	 * data.
+	 *
+HEIKKI: I don't think the comment change you made here was needed or
+helpful, so I reverted it.
 	 */
 	rightoff = P_HIKEY;
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7940297305d..f27148eb27d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -162,8 +162,8 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		new_stack->bts_parent = stack_in;
 
 		/*
-		 * Page level 1 is lowest non-leaf page level prior to leaves.  So,
-		 * if we're on the level 1 and asked to lock leaf page in write mode,
+		 * Page level 1 is lowest non-leaf page level prior to leaves.  So, if
+		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
 		if (opaque->btpo.level == 1 && access == BT_WRITE)
@@ -333,13 +333,14 @@ _bt_moveright(Relation rel,
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
- * on the buffer.  When key.savebinsrch is set, modifies mutable fields
- * of insertion scan key, so that a subsequent call where caller sets
- * key.savebinsrch can reuse the low and strict high bound of original
- * binary search.  Callers that use these fields directly must be
- * prepared for the case where stricthigh isn't on the same page (it
- * exceeds maxoff for the page), and the case where there are no items
- * on the page (high < low).
+ * on the buffer.
+ *
+ * When key.savebinsrch is set, modifies mutable fields of insertion scan
+ * key, so that a subsequent call where caller sets key.restorebinsrch can
+ * reuse the low and strict high bound of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where
+ * stricthigh isn't on the same page (it exceeds maxoff for the page), and
+ * the case where there are no items on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index e010bcdcfa9..3daf5829f82 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -103,9 +103,8 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are defensively
-		 * represented as NULL values, though they should still not
-		 * participate in comparisons.
+		 * If the caller provides no tuple, the key arguments should never be
+		 * used.  Set them to NULL, anyway, to be defensive.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index dc2eafb5665..45899454bba 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -323,8 +323,7 @@ typedef BTStackData *BTStack;
  * BTScanInsert is the btree-private state needed to find an initial position
  * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
  * be confused with a search scankey).  It's used to descend a B-Tree using
- * _bt_search.  For details on its mutable state, see _bt_binsrch and
- * _bt_findinsertloc.
+ * _bt_search.
  *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
@@ -334,9 +333,14 @@ typedef BTStackData *BTStack;
  *
  * scankeys is an array of scan key entries for attributes that are compared.
  * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
+ * starting a regular index scan, some can be omitted.  The array is used as a
  * flexible array member, though it's sized in a way that makes it possible to
  * use stack allocations.  See nbtree/README for full details.
+
+HEIKKI: I don't see anything in the README about stack allocations. What
+exactly does the README reference refer to? No code seems to actually allocate
+this in the stack, so we don't really need that.
+
  */
 
 typedef struct BTScanInsertData
@@ -344,7 +348,8 @@ typedef struct BTScanInsertData
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
 	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.
+	 * _bt_check_unique is called.  See _bt_binsrch and _bt_findinsertloc
+	 * for details.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;

#64

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#63)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm looking at the first patch in the series now. I'd suggest that you
commit that very soon. It's useful on its own, and seems pretty much
ready to be committed already. I don't think it will be much affected by
whatever changes we make to the later patches, anymore.

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches. Some of the things that make _bt_findinsertloc()
fast are missing for v3 indexes. The "consider secondary factors
during nbtree splits" patch actually more than compensates for that
with v3 indexes, at least in some cases, but the first patch applied
on its own will slightly regress performance. At least, I benchmarked
the first patch on its own several months ago and noticed a small
regression at the time, though I don't have the exact details at hand.
It might have been an invalid result, because I wasn't particularly
thorough at the time.

We do make some gains in the first patch (the _bt_check_unique()
thing), but we also check the high key more than we need to within
_bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming
thing isn't as streamlined.

It's a lot of work to validate and revalidate the performance of a
patch like this, and I'd rather commit the first three patches within
a couple of days of each other (I can validate v3 indexes and v4
indexes separately, though). We can put off the other patches for
longer, and treat them as independent. I guess I'd also push the final
amcheck patch following the first three -- no point in holding back on
that. Then we'd be left with "Add "split after new tuple"
optimization", and "Add high key "continuescan" optimization" as
independent improvements that can be pushed at the last minute of the
final CF.

I also had a few comments and questions on some details. I added them in
the same patch, marked with "HEIKKI:". Please take a look.

Will respond now. Any point that I haven't responding to directly has
been accepted.

+HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
+weird to talk about it in the function comment. I didn't understand what
+the point of adding this sentence was, so I removed it.

Maybe there is no point in the comment you reference here, but I like
the idea of "checkingunique", because that symbol name is a common
thread between a number of functions that coordinate with each other.
It's not just a local variable in one function.

@@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
if (P_RIGHTMOST(opaque))
break;
highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+           /*
+            * HEIKKI: This assertion might fire if the user-defined opclass
+            * is broken. It's just an assertion, so maybe that's ok. With a
+            * broken opclass, it's obviously "garbage in, garbage out", but
+            * we should try to behave sanely anyway. I don't remember what
+            * our general policy on that is; should we assert, elog(ERROR),
+            * or continue silently in that case? An elog(ERROR) or
+            * elog(WARNING) would feel best to me, but I don't remember what
+            * we usually do.
+            */
Assert(highkeycmp <= 0);
if (highkeycmp != 0)
break;

We don't really have a general policy on it. However, I don't have any
sympathy for the idea of trying to solider on with a corrupt index. I
also don't think that it's worth making this a "can't happen" error.
Like many of my assertions, this assertion is intended to document an
invariant. I don't actually anticipate that it could ever really fail.

+Should we mention explicitly that this binary-search reuse is only applicable
+if unique checks were performed? It's kind of implied by the fact that it's
+_bt_check_unique() that saves the state, but perhaps we should be more clear
+about it.

I guess so.

+What is a "garbage duplicate"? Same as a "dead duplicate"?

Yes.

+The last sentence, about garbage duplicates, seems really vague. Why do we
+ever do any comparisons that are not strictly necessary? Perhaps it's best to
+just remove that last sentence.

Okay -- will remove.

+
+HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
+could easily imagine a separate function that does microvacuuming, and resets
+(or even updates) the binary-search cache in the insertion key. I agree this
+is a convenient place to do it, though.

It wasn't supposed to be a water-tight argument. I'll just say that
it's convenient.

+/* HEIKKI:
+Do we need 'checkunique' as an argument? If unique checks were not
+performed, the insertion key will simply not have saved state.
+*/

We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).

Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.

I actually like the idea of making certain insertion scan key mutable
state relating to search bounds hidden in the case of "dynamic prefix
truncation" [1]/messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan. Doesn't seem to make sense here, though.

+   /* HEIKKI: I liked this comment that we used to have here, before this patch: */
+   /*----------
+    * If we will need to split the page to put the item on this page,
+    * check whether we can put the tuple somewhere to the right,
+    * instead.  Keep scanning right until we

+   /* HEIKKI: Maybe it's not relevant with the later patches, but at least
+    * with just this first patch, it's still valid. I noticed that the
+    * comment is now in _bt_useduplicatepage, it seems a bit out-of-place
+    * there. */

I don't think it matters, because I don't think that the first patch
can be justified as an independent piece of work. I like the idea of
breaking up the patch series, because it makes it all easier to
understand, but the first three patches are kind of intertwined.

+HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would
+try to microvacuum the page twice: first in _bt_useduplicatepage, and second
+time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if
+there are in fact no LP_DEAD items. That's probably insignificant and not worth
+worrying about, but I thought I'd mention it.

Right. It's also true that all future insertions will reach
_bt_vacuum_one_page() and do the same again, until there either is
garbage, or until the page splits.

-    * rightmost page case), all the items on the right half will be user data
-    * (there is no existing high key that needs to be relocated to the new
-    * right page).
+    * rightmost page case), all the items on the right half will be user
+    * data.
+    *
+HEIKKI: I don't think the comment change you made here was needed or
+helpful, so I reverted it.

I thought it added something when you're looking at it from a
WAL-logging point of view. But I can live without this.

- * starting a regular index scan some can be omitted.  The array is used as a
+ * starting a regular index scan, some can be omitted.  The array is used as a
* flexible array member, though it's sized in a way that makes it possible to
* use stack allocations.  See nbtree/README for full details.
+
+HEIKKI: I don't see anything in the README about stack allocations. What
+exactly does the README reference refer to? No code seems to actually allocate
+this in the stack, so we don't really need that.

The README discusses insertion scankeys in general, though. I think
that you read it that way because you're focussed on my changes, and
not because it actually implies that the README talks about the stack
thing specifically. (But I can change it if you like.)

There is a stack allocation in _bt_first(). This was once just another
dynamic allocation, that called _bt_mkscankey(), but that regressed
nested loop joins, so I had to make it work the same way as before. I
noticed this about six months ago, because there was a clear impact on
the TPC-C "Stock level" transaction, which is now sometimes twice as
fast with the patch series. Note also that commit d961a568, from 2005,
changed the _bt_first() code to use a stack allocation. Besides,
sticking to a stack allocation makes the changes to _bt_first()
simpler, even though it has to duplicate a few things from
_bt_mkscankey().

I could get you a v15 that integrates your changes pretty quickly, but
I'll hold off on that for at least a few days. I have a feeling that
you'll have more feedback for me to work through before too long.

[1]: /messages/by-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#65

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Peter Geoghegan (#64)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point. I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape. It sounds like
really good stuff.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#65)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

I know I'm stating the obvious here, but we don't have many weeks left
at this point. I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape. It sounds like
really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

--
Peter Geoghegan

#67

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#64)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 06/03/2019 04:03, Peter Geoghegan wrote:

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm looking at the first patch in the series now. I'd suggest that you
commit that very soon. It's useful on its own, and seems pretty much
ready to be committed already. I don't think it will be much affected by
whatever changes we make to the later patches, anymore.

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?

* We store the binary search bounds in BTScanInsertData, but they're
only used during insertions.

* The binary search bounds are specific for a particular buffer. But
that buffer is passed around separately from the bounds. It seems easy
to have them go out of sync, so that you try to use the cached bounds
for a different page. The savebinsrch and restorebinsrch is used to deal
with that, but it is pretty complicated.

I came up with the attached (against master), which addresses the 2nd
and 3rd points. I added a whole new BTInsertStateData struct, to hold
the binary search bounds. BTScanInsert now only holds the 'scankeys'
array, and the 'nextkey' flag. The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

+/* HEIKKI:
+Do we need 'checkunique' as an argument? If unique checks were not
+performed, the insertion key will simply not have saved state.
+*/
We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).

Hmm. Perhaps it would be to move the call to _bt_binsrch (or
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
that _bt_findinsertloc would only be responsible for finding the correct
page to insert to. So the overall code, after patch #2, would be like:

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?

Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.

The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.

- * starting a regular index scan some can be omitted.  The array is used as a
+ * starting a regular index scan, some can be omitted.  The array is used as a
* flexible array member, though it's sized in a way that makes it possible to
* use stack allocations.  See nbtree/README for full details.
+
+HEIKKI: I don't see anything in the README about stack allocations. What
+exactly does the README reference refer to? No code seems to actually allocate
+this in the stack, so we don't really need that.
The README discusses insertion scankeys in general, though. I think
that you read it that way because you're focussed on my changes, and
not because it actually implies that the README talks about the stack
thing specifically. (But I can change it if you like.)

There is a stack allocation in _bt_first(). This was once just another
dynamic allocation, that called _bt_mkscankey(), but that regressed
nested loop joins, so I had to make it work the same way as before.

Ah, gotcha, I missed that.

- Heikki

Attachments:

v14-heikki-2-0001-Refactor-nbtree-insertion-scankeys.patchtext/x-patch; name=v14-heikki-2-0001-Refactor-nbtree-insertion-scankeys.patchDownload

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 1c7466c8158..9ec023fae38 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -136,14 +136,14 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -835,8 +835,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1018,7 +1018,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1070,7 +1070,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1099,11 +1099,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1291,8 +1292,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1305,8 +1306,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1411,8 +1412,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1760,13 +1760,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1779,13 +1778,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1801,14 +1799,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69b89a..eb4df2ebbe6 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -609,6 +609,9 @@ original search scankey is consulted as each index entry is sequentially
 scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
+HEIKKI: The above probably needs some updating, now that we have a
+separate BTScanInsert struct to represent an insertion scan key.
+
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
 all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5c2b8034f5e..c3a6c883449 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,15 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTInsertState insertstate,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, BTInsertState insertstate);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +79,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +106,28 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTInsertStateData insertstate;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	/* we need a search key to do our search, so build one */
+	itup_key = _bt_mkscankey(rel, itup);
 
-	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	/*
+	 * Fill in the BTInsertState working area, to track the current page
+	 * and position within the page, to insert to.
+	 */
+	insertstate.itup = itup;
+	insertstate.itemsz = IndexTupleSize(itup);
+	insertstate.itemsz = MAXALIGN(insertstate.itemsz);	/* be safe, PageAddItem will do this but we
+								 * need to be consistent */
+	insertstate.itup_key = itup_key;
+
+	insertstate.bounds_valid = false;
+	insertstate.buf = InvalidBuffer;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,10 +150,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
-		Size		itemsz;
 		Page		page;
 		BTPageOpaque lpageop;
 
@@ -166,9 +170,6 @@ top:
 			page = BufferGetPage(buf);
 
 			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			itemsz = IndexTupleSize(itup);
-			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
-										 * but we need to be consistent */
 
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
@@ -177,10 +178,9 @@ top:
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
-				(PageGetFreeSpace(page) > itemsz) &&
+				(PageGetFreeSpace(page) > insertstate.itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,10 +219,12 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
+	insertstate.buf = buf;
+	buf = InvalidBuffer; 		/* insertstate.buf now owns the buffer */
+
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
 	 * the index.
@@ -244,19 +246,18 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, &insertstate, heapRel,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
 		{
 			/* Have to wait for the other guy ... */
-			_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, insertstate.buf);
 
 			/*
 			 * If it's a speculative insertion, wait for it to finish (ie. to
@@ -277,6 +278,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -286,22 +289,22 @@ top:
 		 * This reasoning also applies to INCLUDE indexes, whose extra
 		 * attributes are not considered part of the key space.
 		 */
-		CheckForSerializableConflictIn(rel, NULL, buf);
+		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
 		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		newitemoff = _bt_findinsertloc(rel, &insertstate, stack, heapRel);
+		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
 		/* just release the buffer */
-		_bt_relbuf(rel, buf);
+		_bt_relbuf(rel, insertstate.buf);
 	}
 
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +312,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +323,22 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in insertstate that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	IndexTuple	itup = insertstate->itup;
+	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -345,10 +350,18 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 	InitDirtySnapshot(SnapshotDirty);
 
-	page = BufferGetPage(buf);
+	page = BufferGetPage(insertstate->buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Find the first tuple with the same key.
+	 *
+	 * This also saves the binary search bounds in insertstate.  We use them
+	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
+	 */
+	offset = _bt_binsrch_insert(rel, insertstate);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
@@ -364,6 +377,26 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use _bt_binsrch search bounds
+			 * to limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply, when the original
+			 * page is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the location
+			 * where the key would belong to is not at the end of the page.
+			 */
+			if (nbuf == InvalidBuffer && offset == insertstate->stricthigh)
+			{
+				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
+				Assert(insertstate->low <= insertstate->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
@@ -378,7 +411,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			 * first, so that we didn't actually get to exit any sooner
 			 * anyway. So now we just advance over killed items as quickly as
 			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * item.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +424,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -488,7 +521,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 * otherwise be masked by this unique constraint
 					 * violation.
 					 */
-					CheckForSerializableConflictIn(rel, NULL, buf);
+					CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
 
 					/*
 					 * This is a definite conflict.  Break the tuple down into
@@ -500,7 +533,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 */
 					if (nbuf != InvalidBuffer)
 						_bt_relbuf(rel, nbuf);
-					_bt_relbuf(rel, buf);
+					_bt_relbuf(rel, insertstate->buf);
+					insertstate->buf = InvalidBuffer;
 
 					{
 						Datum		values[INDEX_MAX_KEYS];
@@ -540,7 +574,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					if (nbuf != InvalidBuffer)
 						MarkBufferDirtyHint(nbuf, true);
 					else
-						MarkBufferDirtyHint(buf, true);
+						MarkBufferDirtyHint(insertstate->buf, true);
 				}
 			}
 		}
@@ -552,11 +586,25 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+			/*
+			 * HEIKKI: This assertion might fire if the user-defined opclass
+			 * is broken. It's just an assertion, so maybe that's ok. With a
+			 * broken opclass, it's obviously "garbage in, garbage out", but
+			 * we should try to behave sanely anyway. I don't remember what
+			 * our general policy on that is; should we assert, elog(ERROR),
+			 * or continue silently in that case? An elog(ERROR) or
+			 * elog(WARNING) would feel best to me, but I don't remember what
+			 * we usually do.
+			 */
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,46 +659,37 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, in 'insertstate'.  In the common case that there were no
+ *		duplicates, we don't need to do any additional binary search
+ *		comparisons here.  Though occasionally, we may still not be able to
+ *		reuse the saved state for our own reasons.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		On entry, insertstate->buf points to the first legal page where the new
+ *		tuple could be inserted.  It must be exclusively-locked.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		On exit, insertstate->buf points to the chosen insertion page, and the
+ *		offset within that page is returned.  If _bt_findinsertloc decides to move
+ *		right, the lock and pin on the original page is released, and the new
+ *		page is exclusively locked instead.
+ *
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+				  BTInsertState insertstate,
 				  BTStack stack,
 				  Relation heapRel)
 {
-	Buffer		buf = *bufptr;
+	Buffer		buf = insertstate->buf;
+	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(buf);
-	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	itemsz = IndexTupleSize(newtup);
-	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
-								 * need to be consistent */
-
 	/*
 	 * Check whether the item can fit on a btree page at all. (Eventually, we
 	 * ought to try to apply TOAST methods if not.) We actually need to be
@@ -659,12 +698,14 @@ _bt_findinsertloc(Relation rel,
 	 * include the ItemId.
 	 *
 	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
+	 *
+	 * HEIKKI: perhaps this should be moved to beginning of _bt_doinsert now...
 	 */
-	if (itemsz > BTMaxItemSize(page))
+	if (insertstate->itemsz > BTMaxItemSize(page))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
+						insertstate->itemsz, BTMaxItemSize(page),
 						RelationGetRelationName(rel)),
 				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 						 "Consider a function index of an MD5 hash of the value, "
@@ -672,56 +713,43 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * An earlier _bt_check_unique() call may well have saved bounds that
+		 * we can use to skip the high key check.  This fastpath cannot be
+		 * used when there are no items on the existing page (other than the
+		 * high key), or when it looks like the new item belongs on the page
+		 * but it might go on a later page instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
+		if (insertstate->bounds_valid && insertstate->low <= insertstate->stricthigh &&
+			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		/*
+		 * If this is the last page that the tuple can legally go to, stop
+		 * here.
+		 */
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
+		if (cmpval != 0)
+			break;
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * Otherwise, we have a choice to insert here, or move right to a
+		 * later page.  Try to balance space utilization the best we can.
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (_bt_useduplicatepage(rel, heapRel, insertstate))
+		{
+			/* decided to insert here */
 			break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -763,27 +791,100 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		insertstate->buf = buf;
+		insertstate->bounds_valid = false;
 	}
 
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * If the page we're about to insert to doesn't have enough room for the
+	 * new tuple, we will have to split it.  If it looks like the page has
+	 * LP_DEAD items, try to remove them, in hope of making room for the new
+	 * item and avoiding the split.
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < insertstate->itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+		insertstate->bounds_valid = false;
+	}
+
+	/*
+	 * Find the position within the page to insert to.  (This will reuse the
+	 * binary search bounds in 'insertstate', if _bt_check_unique was called
+	 * earlier and we're inserting to the first legal page.)
+	 */
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
 
-	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		If we have the choice to insert to current page, or to some later
+ *		later page to the right, this function decides what to do.
+ *
+ *		If the current page doesn't have enough free space for the new
+ *		tuple, we "microvacuum" the page, removing LP_DEAD items, in
+ *		hope that it will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on the current
+ *		page.  Otherwise, caller should move on to the page to the right.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel,
+					 BTInsertState insertstate)
+{
+	Buffer		buf = insertstate->buf;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= insertstate->itemsz)
+		return true;
+
+	/*
+	 * Before considering moving right, see if we can obtain enough space by
+	 * erasing LP_DEAD items.
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+		insertstate->bounds_valid = false;
+
+		if (PageGetFreeSpace(page) >= insertstate->itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -2311,24 +2412,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 1d72fe54081..2694811fd66 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1370,7 +1370,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1420,12 +1420,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8b..88b030ee778 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,6 +24,7 @@
 #include "utils/rel.h"
 
 
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -71,13 +72,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +90,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +127,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +140,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -167,8 +163,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		new_stack->bts_parent = stack_in;
 
 		/*
-		 * Page level 1 is lowest non-leaf page level prior to leaves.  So,
-		 * if we're on the level 1 and asked to lock leaf page in write mode,
+		 * Page level 1 is lowest non-leaf page level prior to leaves.  So, if
+		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
 		if (opaque->btpo.level == 1 && access == BT_WRITE)
@@ -198,8 +194,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +211,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +238,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +264,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +299,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +319,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -348,12 +336,10 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-OffsetNumber
+static OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -375,7 +361,7 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
 		return low;
 
 	/*
@@ -392,7 +378,7 @@ _bt_binsrch(Relation rel,
 	 */
 	high++;						/* establish the loop invariant for high */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,7 +386,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -427,14 +413,117 @@ _bt_binsrch(Relation rel,
 	return OffsetNumberPrev(low);
 }
 
-/*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+/*
+ * Like _bt_binsrch(), but with support for caching the binary search bounds.
+ * Used during insertion, and only on the leaf level.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * Caches the bounds fields in insertstate, so that a subsequent call can
+ * reuse the low and strict high bound of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where
+ * stricthigh isn't on the same page (it exceeds maxoff for the page), and
+ * the case where there are no items on the page (high < low).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+{
+	BTScanInsert key = insertstate->itup_key;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber low,
+				high,
+				stricthigh;
+	int32		result,
+				cmpval;
+
+	page = BufferGetPage(insertstate->buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque));
+	Assert(!key->nextkey);
+
+	if (!insertstate->bounds_valid)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = insertstate->low;
+		high = insertstate->stricthigh;
+	}
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+	{
+		/* Caller can't use stricthigh */
+		insertstate->bounds_valid = false;
+		return low;
+	}
+
+	/*
+	 * Binary search to find the first key on the page >= scan key. (nextkey
+	 * is always false when inserting)
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	if (!insertstate->bounds_valid)
+		high++;					/* establish the loop invariant for high */
+	stricthigh = high;			/* high initially strictly higher */
+
+	cmpval = 1;	/* select comparison value */
+
+	while (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		result = _bt_compare(rel, key, page, mid);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+		{
+			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
+	}
+
+	/*
+	 * At this point we have high == low, but be careful: they could point
+	 * past the last slot on the page.
+	 *
+	 * On a leaf page, we always return the first key >= scan key (resp. >
+	 * scan key), which could be the last slot + 1.
+	 */
+	insertstate->low = low;
+	insertstate->stricthigh = stricthigh;
+	insertstate->bounds_valid = true;
+
+	return low;
+}
+
+
+
+/*----------
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +544,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +577,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +664,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +911,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +941,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +973,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +1020,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +1041,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1144,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1181,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e11867..759859c302e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e451..4438ea29c09 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,37 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,56 +99,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
-		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   arg);
-	}
-
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
 
 		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
+		 * If the caller provides no tuple, the key arguments should never be
+		 * used.  Set them to NULL, anyway, to be defensive.
 		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
+		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
@@ -153,19 +119,10 @@ _bt_mkscankey_nodata(Relation rel)
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
-									   (Datum) 0);
+									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974c..f97a82ae7b3 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4fb92d60a12..920f05cbb0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,77 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not
+ * to be confused with a search scankey.  It's used to descend a B-Tree
+ * using _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan, some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.
+ *
+ * NOTE: The scankeys array looks similar to the scan keys used outside the
+ * btree code, the kind passed to btbeginscan() and btrescan(), but it is used
+ * differently.  The sk_func pointers in BTScanInsert point to btree comparison
+ * support functions (ie, 3-way comparators that return int4 values interpreted
+ * as <0, =0, >0), instead of boolean-returning less-than functions like int4lt.
+ * Also, there is exactly one entry per index column, in the same order as the
+ * columns in the index (although there might be fewer keys than index columns,
+ * indicating that we have no constraints for the remaining index columns.)
+ * See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+/*
+ * Working area used during insertion, to track where to insert to.
+ *
+ * _bt_doinsert fills this in, after descending the tree to the (first legal)
+ * leaf page the new tuple belongs to. It is used to track the current decision
+ * while we perform uniqueness checks and decide the final page to insert to.
+ *
+ * (This should be private within nbtinsert.c, but it's also used by
+ * _bt_binsrch_insert, which is defined in nbtsearch.c)
+ */
+typedef struct BTInsertStateData
+{
+	/* Item we're inserting */
+	IndexTuple	itup;
+	Size		itemsz;
+	BTScanInsert itup_key;
+
+	/* leaf page to insert to */
+	Buffer		buf;
+
+	/*
+	 * Cache of bounds within the current buffer, where the new key value
+	 * belongs to.  Only used for insertions where _bt_check_unique is called.
+	 * See _bt_binsrch_insert and _bt_findinsertloc for details.
+	 */
+	bool		bounds_valid;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+} BTInsertStateData;
+
+typedef BTInsertStateData *BTInsertState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +629,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +643,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);

#68

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#67)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

* We store the binary search bounds in BTScanInsertData, but they're
only used during insertions.

* The binary search bounds are specific for a particular buffer. But
that buffer is passed around separately from the bounds. It seems easy
to have them go out of sync, so that you try to use the cached bounds
for a different page. The savebinsrch and restorebinsrch is used to deal
with that, but it is pretty complicated.

That might be an improvement, but I do think that using mutable state
in the insertion scankey, to restrict a search is an idea that could
work well in at least one other way. That really isn't a once-off
thing, even though it looks that way.

I came up with the attached (against master), which addresses the 2nd
and 3rd points. I added a whole new BTInsertStateData struct, to hold
the binary search bounds. BTScanInsert now only holds the 'scankeys'
array, and the 'nextkey' flag.

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

I'll have to think about that some more. Having a separate
_bt_binsrch_insert() may be worth it, but I'll need to do some
profiling.

Hmm. Perhaps it would be to move the call to _bt_binsrch (or
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
that _bt_findinsertloc would only be responsible for finding the correct
page to insert to. So the overall code, after patch #2, would be like:

Maybe, but as I said it's not like _bt_findinsertloc() doesn't know
all about unique indexes already. This is pointed out in a comment in
_bt_doinsert(), even. I guess that it might have to be changed to say
_bt_findinsertpage() instead, with your new approach.

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

I suppose that this could be an improvement, provided all the
assertions that verify that the work "_bt_findinsertpage()" would have
done if called was in fact unnecessary. (e.g., check the high
key/rightmost-ness)

The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.

I'll look into integrating this with my current draft v15 tomorrow.
Need to sleep on it.

--
Peter Geoghegan

#69

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#68)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

BTW, the !minusinfkey optimization is why we literally never move
right within _bt_findinsertloc() while the regression tests run. We
always land on the correct leaf page to begin with. (It works with
unique index insertions, where scantid is NULL when we descend the
tree.)

In general, there are two good reasons for us to move right:

* There was a concurrent page split (or page deletion), and we just
missed the downlink in the parent, and need to recover.

* We omit some columns from our scan key (at least scantid), and there
are perhaps dozens of matches -- this is not relevant to
_bt_doinsert() code.

The single value strategy used by nbtsplitloc.c does a good job of
making it unlikely that _bt_check_unique()-wise duplicates will cross
leaf pages, so there will almost always be one leaf page to visit.
And, the !minusinfkey optimization ensures that the only reason we'll
move right is because of a concurrent page split, within
_bt_moveright().

The buffer lock coupling move to the right that _bt_findinsertloc()
does should be considered an edge case with all of these measures, at
least with v4 indexes.

--
Peter Geoghegan

#70

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#68)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

Yeah. It's been like that forever, but I must confess I hadn't paid any
attention to it, until now. I had not understood that text in the README
explaining the difference between search and insertion scan keys, before
looking at this patch. Not sure I ever read it with any thought. Now
that I understand it, I don't like the "insertion scan key" name.

BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.

The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.

I haven't given performance much thought, really. I don't expect this
method to be any slower, but the point of the refactoring is to make the
code easier to understand.

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

The idea is that _bt_findinsertpage() would not need to know whether the
unique checks were performed or not. I'd like to encapsulate all the
information about the "insert position we're considering" in the
BTInsertStateData struct. Passing 'checkingunique' as a separate
argument violates that, because when it's set, the key means something
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
'scantid' is set, instead of 'checkingunique'. That would seem better.
If it looks like 'checkingunique', it's making the assumption that if
unique checks were not performed, then we are already positioned on the
correct page, according to the heap TID. But looking at 'scantid' seems
like a more direct way of getting the same information. And then we
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

- Heikki

#71

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Heikki Linnakangas (#70)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 07/03/2019 15:41, Heikki Linnakangas wrote:

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

Yeah. It's been like that forever, but I must confess I hadn't paid any
attention to it, until now. I had not understood that text in the README
explaining the difference between search and insertion scan keys, before
looking at this patch. Not sure I ever read it with any thought. Now
that I understand it, I don't like the "insertion scan key" name.

BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.

The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.

I haven't given performance much thought, really. I don't expect this
method to be any slower, but the point of the refactoring is to make the
code easier to understand.

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

The idea is that _bt_findinsertpage() would not need to know whether the
unique checks were performed or not. I'd like to encapsulate all the
information about the "insert position we're considering" in the
BTInsertStateData struct. Passing 'checkingunique' as a separate
argument violates that, because when it's set, the key means something
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
'scantid' is set, instead of 'checkingunique'. That would seem better.
If it looks like 'checkingunique', it's making the assumption that if
unique checks were not performed, then we are already positioned on the
correct page, according to the heap TID. But looking at 'scantid' seems
like a more direct way of getting the same information. And then we
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has
already been restored, even if it was not set originally when we did
_bt_search.

My dislike here is that passing 'checkingunique' as a separate argument
acts like a "modifier", changing slightly the meaning of the insertion
scan key. If it's not set, we know we're positioned on the correct page.
Otherwise, we might not be. And it's a pretty indirect way of saying
that, as it also depends 'heapkeyspace'. Perhaps add a flag to
BTInsertStateData, to indicate the same thing more explicitly. Something
like "bool is_final_insertion_page; /* when set, no need to move right */".

- Heikki

#72

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#61)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 05/03/2019 05:16, Peter Geoghegan wrote:

Attached is v14, which has changes based on your feedback.

As a quick check of the backwards-compatibility code, i.e.
!heapkeyspace, I hacked _bt_initmetapage to force the version number to
3, and ran the regression tests. It failed an assertion in the
'create_index' test:

(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f2943f9a535 in __GI_abort () at abort.c:79
#2 0x00005622c7d9d6b4 in ExceptionalCondition
(conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace,
page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion",
fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54
#3 0x00005622c78627fb in _bt_compare (rel=0x5622c85afbe0,
key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511
#4 0x00005622c7862640 in _bt_binsrch (rel=0x5622c85afbe0,
key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432
#5 0x00005622c7861ec9 in _bt_search (rel=0x5622c85afbe0,
key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1,
snapshot=0x5622c8353740) at nbtsearch.c:142
#6 0x00005622c7863a44 in _bt_first (scan=0x5622c841e828,
dir=ForwardScanDirection) at nbtsearch.c:1183
#7 0x00005622c785f8b0 in btgettuple (scan=0x5622c841e828,
dir=ForwardScanDirection) at nbtree.c:245
#8 0x00005622c78522e3 in index_getnext_tid (scan=0x5622c841e828,
direction=ForwardScanDirection) at indexam.c:542
#9 0x00005622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at
nodeIndexonlyscan.c:120
#10 0x00005622c7a438d5 in ExecScanFetch (node=0x5622c83ad280,
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9
<IndexOnlyRecheck>) at execScan.c:95
#11 0x00005622c7a4394a in ExecScan (node=0x5622c83ad280,
accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9
<IndexOnlyRecheck>) at execScan.c:145
#12 0x00005622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at
nodeIndexonlyscan.c:322
#13 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at
execProcnode.c:445
#14 0x00005622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at
../../../src/include/executor/executor.h:231
#15 0x00005622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at
nodeAgg.c:406
#16 0x00005622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0)
at nodeAgg.c:1737
#17 0x00005622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552
#18 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at
execProcnode.c:445
#19 0x00005622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at
../../../src/include/executor/executor.h:231
#20 0x00005622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78,
planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x5622c8462088,
execute_once=true) at execMain.c:1645
#21 0x00005622c7a36872 in standard_ExecutorRun
(queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:363
#22 0x00005622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:307
#23 0x00005622c7c357dc in PortalRunSelect (portal=0x5622c8336778,
forward=true, count=0, dest=0x5622c8462088) at pquery.c:929
#24 0x00005622c7c3546f in PortalRun (portal=0x5622c8336778,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x5622c8462088, altdest=0x5622c8462088,
completionTag=0x7ffd7a997d50 "") at pquery.c:770
#25 0x00005622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508
"SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2
IS NULL;") at postgres.c:1215
#26 0x00005622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0,
dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at
postgres.c:4256
#27 0x00005622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at
postmaster.c:4378
#28 0x00005622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at
postmaster.c:4069
#29 0x00005622c7b87633 in ServerLoop () at postmaster.c:1699
#30 0x00005622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160)
at postmaster.c:1372
#31 0x00005622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228

I haven't investigated any deeper, but apparently something's broken.
This was with patch v14, without any further changes.

- Heikki

#73

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#72)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I haven't investigated any deeper, but apparently something's broken.
This was with patch v14, without any further changes.

Try it with my patch -- attached.

I think that you missed that the INCLUDE indexes thing within
nbtsort.c needs to be changed back.

--
Peter Geoghegan

Attachments:

0008-DEBUG-Force-version-3-artificially.patchtext/x-patch; charset=US-ASCII; name=0008-DEBUG-Force-version-3-artificially.patchDownload

From cdfe29c5479da6198aa918f4c373cb8a1a1acfe1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 21 Jan 2019 15:35:37 -0800
Subject: [PATCH 08/12] DEBUG: Force version 3 artificially.

---
 contrib/amcheck/expected/check_btree.out     |  7 ++-----
 contrib/pageinspect/expected/btree.out       |  2 +-
 contrib/pgstattuple/expected/pgstattuple.out | 10 +++++-----
 src/backend/access/nbtree/nbtpage.c          |  2 +-
 src/backend/access/nbtree/nbtsort.c          | 16 ++++------------
 src/backend/postmaster/postmaster.c          |  1 +
 src/test/regress/expected/dependency.out     |  4 ++--
 src/test/regress/expected/event_trigger.out  |  4 ++--
 src/test/regress/expected/foreign_data.out   |  8 ++++----
 src/test/regress/expected/rowsecurity.out    |  4 ++--
 10 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 687fde8fce..60bebb1c00 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -139,11 +139,8 @@ VACUUM delete_test_table;
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
- bt_index_parent_check 
------------------------
- 
-(1 row)
-
+ERROR:  index "delete_test_table_pkey" does not support relocating tuples
+HINT:  Only indexes initialized on PostgreSQL 12 support relocation verification.
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
 -- tuple.  Bloom filter must fingerprint normalized index tuple representation.
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 067e73f21a..7f003bf801 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 4
+version                 | 3
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9920dbfd40..9858ea69d4 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (4,0,8192,0,0,0,0,0,NaN,NaN)
+ (3,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 72af1ef3c1..11231dbfeb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -57,7 +57,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 
 	metad = BTPageGetMeta(page);
 	metad->btm_magic = BTREE_MAGIC;
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_root = rootbknum;
 	metad->btm_level = level;
 	metad->btm_fastroot = rootbknum;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 67cdb44cf5..69e16b33f4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -800,6 +800,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
+	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,20 +819,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 	/*
 	 * Check whether the item can fit on a btree page at all.
-	 *
-	 * Every newly built index will treat heap TID as part of the keyspace,
-	 * which imposes the requirement that new high keys must occasionally have
-	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
-	 * tuple one or two MAXALIGN() quantums larger than the original first
-	 * right tuple it's derived from.  v4 deals with the problem by decreasing
-	 * the limit on the size of tuples inserted on the leaf level by the same
-	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
-	 * limit on internal levels, since pivot tuples may need to make use of
-	 * the resered space.  This should never fail on internal pages.
 	 */
 	if (unlikely(itupsz > BTMaxItemSize(npage)))
 		_bt_check_third_page(wstate->index, wstate->heap,
-							 state->btps_level == 0, npage, itup);
+							 false, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -876,7 +868,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (P_ISLEAF(opageop))
+		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
 		{
 			IndexTuple	lastleft;
 			IndexTuple	truncated;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fe599632d3..d27577508e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2556,6 +2556,7 @@ InitProcessGlobals(void)
 			((uint64) MyStartTimestamp << 12) ^
 			((uint64) MyStartTimestamp >> 20);
 	}
+	rseed = 0;
 	srandom(rseed);
 }
 
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8d31110b87..8e50f8ffbb 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  privileges for table deptest1
+DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 privileges for database regression
-owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+privileges for table deptest1
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index ac41419c7b..0e32d5c427 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of user mapping for regress_evt_user on server useless_server
+DETAIL:  owner of event trigger regress_event_trigger3
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of event trigger regress_event_trigger3
+owner of user mapping for regress_evt_user on server useless_server
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 9c763ec184..4d82d3a7e8 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  privileges for foreign-data wrapper foo
-owner of server s1
+DETAIL:  owner of server s1
+privileges for foreign-data wrapper foo
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  owner of user mapping for regress_test_role on server s6
+DETAIL:  privileges for server s4
 privileges for foreign-data wrapper foo
-privileges for server s4
+owner of user mapping for regress_test_role on server s6
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index bad5199d9e..2e170497c9 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  privileges for table tbl1
-target of policy p on table tbl1
+DETAIL:  target of policy p on table tbl1
+privileges for table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
-- 
2.17.1

#74

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#70)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.

Here is another way of explaining it:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

Well, that sentinel heap TID thing doesn't exist any more, because it
was replaced by the !minusinfkey optimization, which is a
*generalization* of the same idea, which extends it to all columns
(not just the heap TID column). That way, you never have to go to two
pages just because you searched for a value that happened to be at the
"right at the edge" of a leaf page.

Page deletion wants to assume that truncated attributes from the high
key of the page being deleted have actual negative infinity values --
negative infinity is a value, just like any other, albeit one that can
only appear in pivot tuples. This is simulated by VACUUM using
"minusinfkey = true". We go left in the parent, not right, and land on
the correct leaf page. Technically we don't compare the negative
infinity values in the pivot to the negative infinity values in the
scankey, but we return 0 just as if we had, and found them equal.
Similarly, v3 indexes specify "minusinfkey = true" in all cases,
because they always want to go left -- just like in old Postgres
versions. They don't have negative infinity values (matches can be on
either side of the all-equal pivot, so they must go left).

--
Peter Geoghegan

#75

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#74)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan <pg@bowt.ie> wrote:

When I drew you that picture while we were in Lisbon, I mentioned to
you that the patch sometimes used a sentinel scantid value that was
greater than minus infinity, but less than any real scantid. This
could be used to force an otherwise-equal-to-pivot search to go left
rather than uselessly going right. I explained this about 30 minutes
in, when I was drawing you a picture.

I meant the opposite: it could be used to go right, instead of going
left when descending the tree and unnecessarily moving right on the
leaf level.

As I said, moving right on the leaf level (rather than during the
descent) should only happen when it's necessary, such as when there is
a concurrent page split. It shouldn't happen reliably when searching
for the same value, unless there really are matches across multiple
leaf pages, and that's just what we have to do.

--
Peter Geoghegan

#76

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#70)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.

One more thing on this: If you force bitmap index scans (by disabling
index-only scans and index scans with the "enable_" GUCs), then you
get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone
(and the heap, separately). No visibility map accesses, which obscure
the same numbers for a similar index-only scan.

You can then observe that most searches of a single value will touch
the bare minimum number of index pages. For example, if there are 3
levels in the index, you should access only 3 index pages total,
unless there are literally hundreds of matches, and cannot avoid
storing them on more than one leaf page. You'll see that the scan
touches the minimum possible number of index pages, because of:

* Many duplicates strategy. (Not single value strategy, which I
incorrectly mentioned in relation to this earlier.)

* The !minusinfykey optimization, which ensures that we go to the
right of an otherwise-equal pivot tuple in an internal page, rather
than left.

* The "continuescan" high key patch, which ensures that the scan
doesn't go to the right from the first leaf page to try to find even
more matches. The high key on the same leaf page will indicate that
the scan is over, without actually visiting the sibling. (Again, I'm
assuming that your search is for a single value.)

--
Peter Geoghegan

#77

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#67)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I came up with the attached (against master), which addresses the 2nd
and 3rd points. I added a whole new BTInsertStateData struct, to hold
the binary search bounds. BTScanInsert now only holds the 'scankeys'
array, and the 'nextkey' flag. The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

Attached is v15, which does not yet integrate these changes. However,
it does integrate earlier feedback that you posted for v14. I also
cleaned up some comments within nbtsplitloc.c.

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.

The !minusinfkey optimization is more than just a "nice to have".
Suffix truncation makes pivot tuples less restrictive about what can
go on each page, but that might actually hurt performance if we're not
also careful to descend directly to the leaf page where matches will
first appear (rather than descending to a page to its left). If we
needlessly descend to a page that's to the left of the leaf page we
really ought to go straight to, then there are cases that are
regressed rather than helped -- especially cases where splits use the
"many duplicates" strategy. You continue to get correct answers when
the !minusinfkey optimization is ripped out, but it seems almost
essential that we include it. While it's true that we've always had to
move left like this, it's also true that suffix truncation will make
it happen much more frequently. It would happen (without the
!minusinfkey optimization) most often where suffix truncation makes
pivot tuples smallest.

Once you grok the minusinfkey stuff, then we'll be in a better
position to work through the feedback about _bt_binsrch_insert() and
so on, I think. You may lack all of the context of how the second
patch goes on to use the new insertion scan key struct, so it will
probably make life easier if we're both on the same page. (Pun very
much intended.)

Thanks again!
--
Peter Geoghegan

Attachments:

v15-0001-Refactor-nbtree-insertion-scankeys.patchapplication/octet-stream; name=v15-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From 5cced8f6fc89c2883103de4a29aee47635b8700a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v15 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() now falls out of its loop
immediately in the common case where it's already clear that there
couldn't possibly be a duplicate.  More importantly, the new
_bt_check_unique() scheme makes it a lot easier to manage cached binary
search effort afterwards, from within _bt_findinsertloc().  This is
needed for the upcoming patch to make nbtree tuples unique by treating
heap TID as a final tie-breaker column.

Based on a suggestion by Andrey Lepikhov.
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/README      |  29 ++-
 src/backend/access/nbtree/nbtinsert.c | 354 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 167 +++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  98 +++----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 ++++-
 9 files changed, 458 insertions(+), 339 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a767..053ac9d192 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -138,14 +138,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -837,8 +837,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1029,7 +1029,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1081,7 +1081,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1110,11 +1110,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1302,8 +1303,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1316,8 +1317,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1422,8 +1423,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1863,13 +1863,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1882,13 +1881,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1904,14 +1902,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index b0b4ab8b76..a295a7a286 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -598,19 +598,22 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
-sk_func pointers point to btree comparison support functions (ie, 3-way
-comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0).  In an insertion scankey there is at most one
+entry per index column.  There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple.  (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2b18028823..ccd4549806 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +110,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,7 +140,6 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -179,8 +174,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +213,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +237,12 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, itup_key, itup, heapRel, buf,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +269,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -287,10 +281,17 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
+									   itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
@@ -301,7 +302,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +310,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +321,20 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in itup_key that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -349,9 +350,17 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Save binary search bounds.  We use them in the fastpath below, but
+	 * also in the _bt_findinsertloc() call later.
+	 */
+	itup_key->savebinsrch = true;
+	offset = _bt_binsrch(rel, itup_key, buf);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(itup_key->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,21 +373,39 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use _bt_binsrch search bounds
+			 * to limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply, when the original
+			 * page is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the location
+			 * where the key would belong to is not at the end of the page.
+			 */
+			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
+			{
+				Assert(itup_key->low >= P_FIRSTDATAKEY(opaque));
+				Assert(itup_key->low <= itup_key->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive.  Just advance over killed
+			 * items as quickly as we can.  We only apply _bt_isequal() when
+			 * we get to a non-killed item.  Even those comparisons could be
+			 * avoided (in the common case where there is only one page to
+			 * visit) by reusing bounds, but just skipping dead items is
+			 * sufficiently effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +418,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +579,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,39 +641,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, in the insertion scan key.  In the common case that there
+ *		were no duplicates, we don't need to do any additional binary search
+ *		comparisons here.  Though occasionally, we may still not be able to
+ *		reuse the saved state for our own reasons.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
+ *		On entry, *bufptr points to the first legal page where the new tuple
+ *		could be inserted.  The caller must hold an exclusive lock on *bufptr.
+ *
+ *		On exit, *bufptr points to the chosen insertion page, and the offset
+ *		within that page is returned.  If _bt_findinsertloc decides to move
+ *		right, the lock and pin on the original page is released, and the new
  *		page returned to the caller is exclusively locked instead.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It convenient to make it happen here, since microvacuuming
+ *		may invalidate a _bt_check_unique() caller's cached binary search
+ *		work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +703,40 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * An earlier _bt_check_unique() call may well have established bounds
+		 * that we can use to skip the high key check for checkingunique
+		 * callers.  This fastpath cannot be used when there are no items on
+		 * the existing page (other than high key), or when it looks like the
+		 * new item belongs last on the page, but it might go on a later page
+		 * instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * If this is the last page that the tuple can legally go on, stop
+		 * here
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+		/*
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
+		 */
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +779,104 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * If the page we're about to insert to doesn't have enough room for the
+	 * new tuple, we will have to split it.  If it looks like the page has
+	 * LP_DEAD items, try to remove them, in hope of making room for the new
+	 * item and avoiding the split.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first leaf page locked is still
+	 * locked because that's the page caller will insert on to
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	itup_key->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_key, buf);
+	Assert(!itup_key->restorebinsrch);
+	Assert(!restorebinsrch || newitemoff == _bt_binsrch(rel, itup_key, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion of
+ *		a duplicate into an index should insert on the page contained in buf
+ *		when a choice must be made.
+ *
+ *		If the current page doesn't have enough free space for the new tuple
+ *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
+ *		will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	/*
+	 * Before considering moving right, see if we can obtain enough space
+	 * by erasing LP_DEAD items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -2310,24 +2403,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..30be88eb82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,26 +334,45 @@ _bt_moveright(Relation rel,
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
+ *
+ * When key.savebinsrch is set, modifies mutable fields of insertion scan
+ * key, so that a subsequent call where caller sets key.restorebinsrch can
+ * reuse the low and strict high bound of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where
+ * stricthigh isn't on the same page (it exceeds maxoff for the page), and
+ * the case where there are no items on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+		isleaf = P_ISLEAF(opaque);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		Assert(P_ISLEAF(opaque));
+		low = key->low;
+		high = key->stricthigh;
+		isleaf = true;
+	}
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -375,8 +381,19 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
+	{
+		if (key->savebinsrch)
+		{
+			Assert(isleaf);
+			/* Caller can't use stricthigh */
+			key->low = low;
+			key->stricthigh = high;
+		}
+		key->savebinsrch = false;
+		key->restorebinsrch = false;
 		return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,9 +407,12 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
+	if (!key->restorebinsrch)
+		high++;					/* establish the loop invariant for high */
+	key->restorebinsrch = false;
+	stricthigh = high;			/* high initially strictly higher */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,12 +420,21 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
 	}
 
 	/*
@@ -415,7 +444,14 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (key->savebinsrch)
+	{
+		Assert(isleaf);
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
+	}
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +464,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +486,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +519,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +606,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +853,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +883,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +915,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +983,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1086,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1125,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..acd122aa53 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,39 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->savebinsrch = key->restorebinsrch = false;
+	key->low = key->stricthigh = InvalidOffsetNumber;
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +101,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are
+		 * defensively represented as NULL values.  They should never be
+		 * used.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +125,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..7870f282d2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.  See _bt_binsrch and _bt_findinsertloc for
+	 * details.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v15-0003-Consider-secondary-factors-during-nbtree-splits.patchapplication/octet-stream; name=v15-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From 203f4323a09b73773e91e762b75c8577d8f6febd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v15 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pre-pg_upgrade'd v3 indexes make use of these
optimizations.  Benchmarking has shown that even v3 indexes benefit,
despite the fact that suffix truncation will only truncate non-key
attributes in INCLUDE indexes.  Grouping relatively similar tuples
together is beneficial in and of itself, since it reduces the number of
leaf pages that must be accessed by subsequent index scans.
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 290 +--------
 src/backend/access/nbtree/nbtsplitloc.c | 757 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 867 insertions(+), 293 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40ff25fe06..ca4fdf7ac4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -661,6 +661,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 0b01fa3d83..5f400372e7 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -920,8 +893,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -1016,7 +988,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1700,264 +1672,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..34165c52db
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,757 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval (default strategy only) */
+#define MAX_LEAF_INTERVAL			9
+#define MAX_INTERNAL_INTERVAL		18
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} FindSplitStrat;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int16		curdelta;		/* current leftfree/rightfree delta */
+	int16		leftfree;		/* space left on left page post-split */
+	int16		rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recsplitloc */
+	Relation	rel;			/* index relation */
+	Page		page;			/* page undergoing split */
+	IndexTuple	newitem;		/* new item (cause of page split) */
+	Size		newitemsz;		/* size of newitem (includes line pointer) */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			interval;		/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
+				 bool *newitemonleft);
+static int _bt_strategy(FindSplitData *state, SplitPoint *lowpage,
+			 SplitPoint *highpage, FindSplitStrat *strategy);
+static inline int _bt_split_penalty(FindSplitData *state, SplitPoint *split);
+static inline IndexTuple _bt_split_lastleft(FindSplitData *state,
+				   SplitPoint *split);
+static inline IndexTuple _bt_split_firstright(FindSplitData *state,
+					 SplitPoint *split);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * there are a number of further special cases where fillfactor is not
+ * applied in the standard way.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	FindSplitStrat strategy;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	double		fillfactormult;
+	bool		usemult;
+	SplitPoint	lowpage,
+				highpage;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.rel = rel;
+	state.page = page;
+	state.newitem = newitem;
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all points
+	 * between tuples will be legal).
+	 */
+	state.maxsplits = maxoff;
+	state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (any other split whose firstoldonright is the first data offset
+	 * can't be legal, though, and so won't actually end up being recorded in
+	 * first loop iteration).
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default strategy), while applying
+	 * rightmost where appropriate.  Either of the two other fallback
+	 * strategies may be required for cases with a large number of duplicates
+	 * around the original/space-optimal split point.
+	 *
+	 * Default strategy gives some weight to suffix truncation in deciding a
+	 * split point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = state.is_rightmost;
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (state.is_rightmost)
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		/* fillfactormult not used, but be tidy */
+		fillfactormult = 0.50;
+	}
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	state.interval = Min(Max(1, state.nsplits * 0.05),
+						 state.is_leaf ? MAX_LEAF_INTERVAL :
+						 MAX_INTERNAL_INTERVAL);
+
+	/*
+	 * Save leftmost and rightmost splits for page before original ordinal
+	 * sort order is lost by delta/fillfactormult sort
+	 */
+	lowpage = state.splits[0];
+	highpage = state.splits[state.nsplits - 1];
+
+	/* Give split points a fillfactormult-wise delta, and sort on deltas */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Determine if default strategy/split interval will produce a
+	 * sufficiently distinguishing split, or if we should change strategies.
+	 * Alternative strategies change the range of split points that are
+	 * considered acceptable (split interval), and possibly change
+	 * fillfactormult, in order to deal with pages with a large number of
+	 * duplicates gracefully.
+	 *
+	 * Pass low and high splits for the entire page (including even newitem).
+	 * These are used when the initial split interval encloses split points
+	 * that are full of duplicates, and we need to consider if it's even
+	 * possible to avoid appending a heap TID.
+	 */
+	perfectpenalty = _bt_strategy(&state, &lowpage, &highpage, &strategy);
+
+	if (strategy == SPLIT_DEFAULT)
+	{
+		/*
+		 * Default strategy worked out (always works out with internal page).
+		 * Original split interval still stands.
+		 */
+	}
+
+	/*
+	 * Many duplicates strategy is used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates.
+	 *
+	 * The split interval is widened to include all legal candidate split
+	 * points.  There may be a few as two distinct values in the whole-page
+	 * split interval.  Many duplicates strategy has no hard requirements for
+	 * space utilization, though it still keeps the use of space balanced as a
+	 * non-binding secondary goal (perfect penalty is set so that the
+	 * first/lowest delta split points that avoids appending a heap TID is
+	 * used).
+	 *
+	 * Single value strategy is used when it is impossible to avoid appending
+	 * a heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently.  (Single value strategy is harmless though not
+	 * particularly useful with !heapkeyspace indexes.)
+	 */
+	else if (strategy == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.interval = state.nsplits;
+	}
+	else if (strategy == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split near the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		/* Appending a heap TID is unavoidable, so interval of 1 is fine */
+		state.interval = 1;
+	}
+
+	/*
+	 * Search among acceptable split points (using final split interval) for
+	 * the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(&state, perfectpenalty, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int16		leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int16) (firstrightitemsz +
+							 MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int16) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int16) state->newitemsz;
+	else
+		rightfree -= (int16) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int16) firstrightitemsz -
+			(int16) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		Assert(state->nsplits < state->maxsplits);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int16		delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty among split points
+ * that fall within current/final split interval.  Penalty is an abstract
+ * score, with a definition that varies depending on whether we're splitting a
+ * leaf page or an internal page.  See _bt_split_penalty() for details.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice (or when we only want a
+ * minimally distinguishing split point, and don't want to make the split any
+ * more unbalanced than is necessary).
+ *
+ * We return the index of the first existing tuple that should go on the right
+ * page, plus a boolean indicating if new item is on left of split point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(FindSplitData *state, int perfectpenalty, bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->interval, state->nsplits);
+
+	/* No point in calculating penalty when there's only one choice */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(state, state->splits + i);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to decide whether split should use default strategy/initial
+ * split interval, or whether it should finish splitting the page using
+ * alternative strategies (this is only possible with leaf pages).
+ *
+ * Caller uses alternative strategy (or sticks with default strategy) based
+ * on how *strategy is set here.  Return value is "perfect penalty", which is
+ * passed to _bt_bestsplitloc() as a final constraint on how far caller is
+ * willing to go to avoid appending a heap TID when using the many duplicates
+ * strategy (it also saves _bt_bestsplitloc() useless cycles).
+ */
+static int
+_bt_strategy(FindSplitData *state, SplitPoint *lowpage, SplitPoint *highpage,
+			 FindSplitStrat *strategy)
+{
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *lowinterval,
+			   *highinterval;
+	int			perfectpenalty;
+	int			highsplit = Min(state->interval, state->nsplits);
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume that alternative strategy won't be used for now */
+	*strategy = SPLIT_DEFAULT;
+
+	/*
+	 * There is no way to know what the smallest tuple could be efficiently
+	 * with internal pages, though inspecting the size of the tuples is cheap,
+	 * so caller is exhaustive in search for smallest pivot within interval.
+	 */
+	if (!state->is_leaf)
+		return MAXALIGN(sizeof(IndexTupleData) + 1);
+
+	/*
+	 * Use leftmost and rightmost tuples within current acceptable range of
+	 * split points (using current split interval)
+	 */
+	lowinterval = state->splits;
+	highinterval = state->splits + highsplit - 1;
+	leftmost = _bt_split_lastleft(state, lowinterval);
+	rightmost = _bt_split_firstright(state, highinterval);
+
+	/*
+	 * If initial split interval can produce a split point that will at least
+	 * avoid appending a heap TID in new high key, we're done.  Finish split
+	 * with default strategy and initial split interval.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+		return perfectpenalty;
+
+	/*
+	 * Work out how caller should finish split when even their "perfect"
+	 * penalty for initial/default split interval indicates that the interval
+	 * does not contain even a single split that avoids appending a heap TID.
+	 *
+	 * Use the leftmost split's lastleft tuple and the rightmost split's
+	 * firstright tuple to assess every possible split.
+	 */
+	leftmost = _bt_split_lastleft(state, lowpage);
+	rightmost = _bt_split_firstright(state, highpage);
+
+	/*
+	 * If page (including new item) has many duplicates but is not entirely
+	 * full of duplicates, a many duplicates strategy split will be performed.
+	 * If page is entirely full of duplicates, a single value strategy split
+	 * will be performed.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+	{
+		*strategy = SPLIT_MANY_DUPLICATES;
+
+		/*
+		 * Caller should choose the lowest delta split that avoids appending a
+		 * heap TID.  Maximizing the number of attributes that can be
+		 * truncated away (returning perfectpenalty when it happens to be less
+		 * than the number of key attributes in index) can result in continual
+		 * unbalanced page splits.
+		 *
+		 * Just avoiding appending a heap TID can still make splits very
+		 * unbalanced, but this is self-limiting.  When final split has a very
+		 * high delta, one side of the split will likely consist of a single
+		 * value.  If that page is split once again, then that split will
+		 * likely use the single value strategy.
+		 */
+		return indnkeyatts;
+	}
+
+	/*
+	 * Single value strategy is only appropriate with ever-increasing heap
+	 * TIDs; otherwise, original default strategy split should proceed to
+	 * avoid pathological performance.  Use page high key to infer if this is
+	 * the rightmost page among pages that store the same duplicate value.
+	 * This should not prevent insertions of heap TIDs that are slightly out
+	 * of order from using single value strategy, since that's expected with
+	 * concurrent inserters of the same duplicate value.
+	 */
+	else if (state->is_rightmost)
+		*strategy = SPLIT_SINGLE_VALUE;
+	else
+	{
+		ItemId		itemid;
+		IndexTuple	hikey;
+
+		itemid = PageGetItemId(state->page, P_HIKEY);
+		hikey = (IndexTuple) PageGetItem(state->page, itemid);
+		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
+											 state->newitem);
+		if (perfectpenalty <= indnkeyatts)
+			*strategy = SPLIT_SINGLE_VALUE;
+		else
+		{
+			/*
+			 * Have caller finish split using default strategy, since page
+			 * does not appear to be the rightmost page for duplicates of the
+			 * value the page is filled with
+			 */
+		}
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (excluding ItemId overhead) which becomes the new
+ * high key for the left page.
+ */
+static inline int
+_bt_split_penalty(FindSplitData *state, SplitPoint *split)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	firstrighttuple = _bt_split_firstright(state, split);
+
+	if (!state->is_leaf)
+		return IndexTupleSize(firstrighttuple);
+
+	lastleftuple = _bt_split_lastleft(state, split);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(state->rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page,
+						   OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 5d53f58f09..2f0124d289 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2315,6 +2316,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 167b79f2f4..6d00089f53 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -160,11 +160,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -702,6 +706,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -768,6 +779,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v15-0005-Add-split-after-new-tuple-optimization.patchapplication/octet-stream; name=v15-0005-Add-split-after-new-tuple-optimization.patchDownload

From ab3453766995aa5ffabc349f574cdcb993501b70 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v15 5/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values in indexes with multiple columns.  When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
The optimization is very similar to the long established fillfactor
optimization used during rightmost page splits, where leaf page are
similarly left about 90% full.  It should be considered a variation of
the same optimization.

50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average.  Future insertions can never make use of the
free space left behind.  With this patch, affected cases have leaf pages
that are about 90% full on average (with a fillfactor of 90).  Localized
monotonically increasing insertion patterns are presumed to be fairly
common in real-world applications.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsplitloc.c | 227 +++++++++++++++++++++++-
 1 file changed, 224 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 34165c52db..745e394db4 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -69,6 +69,9 @@ static void _bt_recsplitloc(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
 				 bool *newitemonleft);
 static int _bt_strategy(FindSplitData *state, SplitPoint *lowpage,
@@ -245,9 +248,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default strategy), while applying
-	 * rightmost where appropriate.  Either of the two other fallback
-	 * strategies may be required for cases with a large number of duplicates
-	 * around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback strategies may be required for cases
+	 * with a large number of duplicates around the original/space-optimal
+	 * split point.
 	 *
 	 * Default strategy gives some weight to suffix truncation in deciding a
 	 * split point on leaf pages.  It attempts to select a split point where a
@@ -269,6 +273,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization, either by
+		 * applying leaf fillfactor multiplier, or by choosing the exact split
+		 * point that leaves the new item as last on the left. (usemult is set
+		 * for us.)
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -512,6 +554,185 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in monotonically
+ * increasing order, but that shouldn't matter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by 50:50/!usemult
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index (i.e. typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult)
+{
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(!state->is_rightmost);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (state->newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (state->newitemsz >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
+		sizeof(ItemIdData))
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples.  Surmise that
+	 * page has equisized tuples when page layout is consistent with having
+	 * maxoff-1 non-pivot tuples that are all the same size as the newly
+	 * inserted tuple (note that the possibly-truncated high key isn't counted
+	 * in olddataitemstotal).
+	 */
+	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (state->newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(state->page, maxoff);
+		tup = (IndexTuple) PageGetItem(state->page, itemid);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, but is deemed suitable for the
+	 * optimization, caller splits at the point immediately after the would-be
+	 * position of the new item, and immediately before the item after the new
+	 * item.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
+	tup = (IndexTuple) PageGetItem(state->page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+	/*
+	 * Don't allow caller to split after a new item when it will result in a
+	 * split point to the right of the point that a leaf fillfactor split
+	 * would use -- have caller apply leaf fillfactor instead.  There is no
+	 * advantage to being very aggressive in any case.  It may not be legal to
+	 * split very close to maxoff.
+	 */
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v15-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchapplication/octet-stream; name=v15-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From 3a9e101fc838fb37f7389ea4140d1e3fd0d3d227 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v15 4/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 +++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 157 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 181 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0a005afa34..ee97463894 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -74,6 +74,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -123,10 +125,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -139,6 +142,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -176,7 +180,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -195,11 +199,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -208,7 +215,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -266,7 +274,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -337,7 +345,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -361,6 +369,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -429,6 +438,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -921,6 +938,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_minusinfkey(state->rel, itup);
 
@@ -1525,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1925,6 +1971,101 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ *
+ * Alternative nbtree design that could be used perform "cousin verification"
+ * inexpensively/transitively (may make current approach clearer):
+ *
+ * A cousin leaf page has a lower bound that comes from its grandparent page
+ * rather than its parent page, as already discussed (note also that a "second
+ * cousin" leaf page gets its lower bound from its great-grandparent, and so
+ * on).  Any pivot tuple in the root page after the first tuple (which is an
+ * "absolute" negative infinity tuple, since its leftmost on the level) should
+ * separate every leaf page into <= and > pages.  Even with the current
+ * design, there should be an unbroken seam of identical-to-root-pivot high
+ * key separator values at the right edge of the <= pages, all the way down to
+ * (and including) the leaf level.  Recall that page deletion won't delete the
+ * rightmost child of a parent page unless the child page is the only child,
+ * in which case the parent is deleted with the child.
+ *
+ * If we didn't truncate the item at first/negative infinity offset to zero
+ * attributes during internal page splits then there would also be an unbroken
+ * seam of identical-to-root-pivot "low key" separator values on the left edge
+ * of the pages that are > the separator value; this alternative design would
+ * allow us to verify the same invariants directly, without ever having to
+ * cross more than one level of the tree (we'd still have to cross one level
+ * because leaf pages would still not store a low key directly, and we'd still
+ * need bitwise-equality cross checks of downlink separator in parent against
+ * the low keys in their non-leaf children).
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	/* No need to use bt_mkscankey_minusinfkey() here */
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal certain remaining inconsistencies.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		OffsetNumber offnum;
+		Page		page;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch(state->rel, key, lbuf);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v15-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchapplication/octet-stream; name=v15-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchDownload

From 6ddc30d63d6eceab33f9d97047c934062cddafef Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v15 2/7] Make heap TID a tie-breaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.  The new maximum allowed size is 2704 bytes on 64-bit
systems, down from 2712 bytes.
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 344 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 136 +++---
 src/backend/access/nbtree/nbtinsert.c        | 298 ++++++++-----
 src/backend/access/nbtree/nbtpage.c          | 196 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 136 +++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 430 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 211 +++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 28 files changed, 1601 insertions(+), 514 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 053ac9d192..0a005afa34 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -137,17 +141,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_minusinfkey(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -204,6 +213,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -254,7 +264,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -324,8 +336,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -346,6 +358,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -806,7 +819,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -839,6 +853,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -865,7 +880,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -906,7 +922,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_minusinfkey(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -940,9 +1005,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -968,11 +1059,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1035,7 +1125,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1213,9 +1303,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1304,7 +1394,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_minusinfkey(state->rel, firstitup);
 }
 
 /*
@@ -1367,7 +1457,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1416,14 +1507,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1855,6 +1961,66 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  It is even capable of determining that a "minus
+	 * infinity value" from a "minusinfkey" scankey is equal to a pivot's
+	 * truncated attribute.  However, it is not capable of determining that a
+	 * scankey ("minusinfkey" or otherwise) is _less than_ a tuple on the
+	 * basis of a comparison resolved at _scankey_ minus infinity attribute.
+	 *
+	 * Somebody could teach _bt_compare() to handle this on its own, but that
+	 * would add overhead to index scans.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1874,42 +2040,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2065,3 +2273,61 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically sets insertion scankey to have
+ * minus infinity values for truncated attributes from itup (when itup is a
+ * pivot tuple with one or more truncated attributes).
+ *
+ * In a non-corrupt heapkeyspace index, all pivot tuples on a level have
+ * unique keys, so the !minusinfkey optimization correctly guides scans that
+ * aren't interested in relocating a leaf page using leaf page's high key
+ * (i.e. optimization can safely be used by the vast majority of all
+ * _bt_search() calls).  nbtree verification should always use "minusinfkey"
+ * semantics, though, because the !minusinfkey optimization might mask a
+ * problem in a corrupt index.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !minusinfkey tie-breaker might otherwise
+ * cause amcheck to conclude that the scankey is greater, missing index
+ * corruption.  It's unlikely that the same problem would not be caught some
+ * other way, but the !minusinfkey optimization has no upside for amcheck, so
+ * it seems sensible to always avoid it.
+ */
+static inline BTScanInsert
+bt_mkscankey_minusinfkey(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->minusinfkey = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a..cb23be859d 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a295a7a286..40ff25fe06 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -615,22 +626,40 @@ scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -643,20 +672,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -664,4 +699,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ccd4549806..0b01fa3d83 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -118,6 +120,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -223,12 +228,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -265,6 +271,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -273,12 +283,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
@@ -290,8 +300,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
-					   false);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
@@ -361,6 +371,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(itup_key->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -630,16 +641,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple belongs on.
+ *		Occasionally, this won't be exactly right for callers that just
+ *		called _bt_check_unique(), and did initial search without using a
+ *		scantid.  They'll have to insert into a page somewhere to the right
+ *		in rare cases where there are many physical duplicates in a unique
+ *		index, and their scantid directs us to some page full of duplicates
+ *		to the right, where the new tuple must go.  (Actually, since
+ *		!heapkeyspace pg_upgraded'd non-unique indexes never get a scantid,
+ *		they too may require that we move right.  We treat them somewhat like
+ *		unique indexes.)
  *
  *		_bt_check_unique() saves the progress of the binary search it
  *		performs, in the insertion scan key.  In the common case that there
@@ -682,28 +693,26 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -711,6 +720,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * An earlier _bt_check_unique() call may well have established bounds
 		 * that we can use to skip the high key check for checkingunique
 		 * callers.  This fastpath cannot be used when there are no items on
@@ -718,8 +734,10 @@ _bt_findinsertloc(Relation rel,
 		 * new item belongs last on the page, but it might go on a later page
 		 * instead.
 		 */
-		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
-			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+				 itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		/*
@@ -729,15 +747,24 @@ _bt_findinsertloc(Relation rel,
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -746,6 +773,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -816,9 +845,16 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
- *		This function handles the question of whether or not an insertion of
- *		a duplicate into an index should insert on the page contained in buf
- *		when a choice must be made.
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should insert
+ *		on the page contained in buf when a choice must be made.  It is only
+ *		used with pg_upgrade'd version 2 and version 3 indexes (!heapkeyspace
+ *		indexes).
  *
  *		If the current page doesn't have enough free space for the new tuple
  *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
@@ -911,6 +947,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -933,7 +970,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -983,8 +1020,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,7 +1103,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1121,6 +1158,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1188,17 +1227,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1292,7 +1333,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1306,8 +1348,29 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach of
+	 * using the left page's last item as its new high key when splitting on
+	 * the leaf level.  It isn't, though: suffix truncation will leave the
+	 * left page's high key fully equal to the last item on the left page when
+	 * two tuples with equal key values (excluding heap TID) enclose the split
+	 * point.  It isn't actually necessary for a new leaf high key to be equal
+	 * to the last item on the left for the L&Y "subtree" invariant to hold.
+	 * It's sufficient to make sure that the new leaf high key is strictly
+	 * less than the first item on the right leaf page, and greater than or
+	 * equal to (not necessarily equal to) the last item on the left leaf
+	 * page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap TID
+	 * from lastleft will be used as the new high key, since the last left
+	 * tuple could be physically larger despite being opclass-equal in respect
+	 * of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1325,25 +1388,50 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate unneeded key attributes of the high key item before
+	 * inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
 	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * though; truncating unneeded key suffix attributes can only be
+	 * performed at the leaf level anyway.  This is because a pivot tuple in
+	 * a grandparent page must guide a search not only to the correct parent
+	 * page, but also to the correct leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.
+		 * This is needed to decide how many attributes from the first item
+		 * on the right page must remain in new high key for left page.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1536,7 +1624,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1565,22 +1652,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1596,9 +1671,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1961,7 +2034,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1991,7 +2064,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2252,7 +2325,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2287,7 +2360,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2320,6 +2394,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2384,6 +2460,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2399,8 +2476,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2413,6 +2490,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..72af1ef3c1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1422,6 +1494,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
+				/* absent attributes need explicit minus infinity values */
+				itup_key->minusinfkey = true;
 				/* get stack to leaf page by searching index */
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 30be88eb82..0a6cf96bc9 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -359,6 +365,9 @@ _bt_binsrch(Relation rel,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	if (!key->restorebinsrch)
 	{
 		low = P_FIRSTDATAKEY(opaque);
@@ -368,6 +377,7 @@ _bt_binsrch(Relation rel,
 	else
 	{
 		/* Restore result of previous binary search against same page */
+		Assert(!key->heapkeyspace || key->scantid != NULL);
 		Assert(P_ISLEAF(opaque));
 		low = key->low;
 		high = key->stricthigh;
@@ -447,6 +457,7 @@ _bt_binsrch(Relation rel,
 	if (key->savebinsrch)
 	{
 		Assert(isleaf);
+		Assert(key->scantid == NULL);
 		key->low = low;
 		key->stricthigh = stricthigh;
 		key->savebinsrch = false;
@@ -478,10 +489,11 @@ _bt_binsrch(Relation rel,
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
+ * scankey.  The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon.  This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key.  See backend/access/nbtree/README for details.
  *----------
  */
 int32
@@ -493,10 +505,15 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
+	Assert(key->heapkeyspace || key->minusinfkey);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -506,6 +523,7 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -519,8 +537,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -571,8 +591,95 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.  (This is also a convenient point to check if
+	 * the !minusinfkey optimization can be used.)
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches (all !minusinfkey searches) can safely have their
+		 * scankey considered greater than a truncated pivot tuple if and
+		 * when the scankey has equal values for attributes up to and
+		 * including the least significant untruncated attribute in tuple.
+		 * Rather than considering the scankey perfectly equal, and having
+		 * caller move left of the separator key in the pivot when
+		 * descending, caller can instead be told to descend right, avoiding
+		 * an unnecessary leaf page visit.  _bt_first() might otherwise
+		 * needlessly apply its "end-of-page case" when searching for a key
+		 * value that happens to be "near page boundary" (most other
+		 * _bt_search() callers have similar handling).  It isn't truly
+		 * necessary for caller to descend left, provided we're sure that the
+		 * search isn't interested in finding the leaf page whose high key is
+		 * an exact match to the pivot in its parent page (the pivot that we
+		 * force a tiebreaker for here).
+		 *
+		 * Fully filled-in scankeys (scankeys that provide keys for
+		 * attributes up to and including scantid) never need to use this
+		 * optimization.  There is always some key value that must be greater
+		 * than minus infinity, so minus infinity comparisons can be resolved
+		 * using "key->keysz > ntuppatts" instead (or at equivalent point for
+		 * scantid).  You can think of a !minusinfkey insertion scankey as
+		 * providing sentinel values for attributes that are not explicitly
+		 * filled-in.  Keys/attributes beyond keysz compare greater than
+		 * minus infinity, but less than any other possible value.
+		 * (_bt_binsrch() returns first match < scankey on non-leaf pages, so
+		 * it doesn't matter that we'll return 0 instead of -1).
+		 *
+		 * _bt_search() callers that consider leaf high keys to be a match
+		 * for the search (currently limited to VACUUM/page deletion) set the
+		 * minusinfkey field true, so that they'll always be returned the
+		 * leaf page with a matching high key.  You can think of a
+		 * minusinfkey insertion scankey as requiring that keys beyond keysz
+		 * be equal to minus infinity. (This works because no value can
+		 * possibly be less than minus infinity, and because page deletion
+		 * doesn't search within leaf page it will delete.)
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key
+		 * attributes (often though not necessarily just the heap TID
+		 * attribute).
+		 *
+		 * Note: pg_upgrade'd !heapkeyspace scans always set minusinfkey
+		 * true, and scantid to NULL, to ensure that searches never behave as
+		 * if heap TID is part of the keyspace.  Truncation of key attributes
+		 * is never performed in !heapkeyspace indexes, and they cannot have
+		 * minus infinity values, so this is an abuse of the notation.  It
+		 * allows us to avoid introducing heapkeyspace branches in
+		 * performance critical code.
+		 */
+		if (!key->minusinfkey && key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1089,7 +1196,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scan key fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
+	inskey.minusinfkey = !inskey.heapkeyspace;
 	inskey.nextkey = nextkey;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..67cdb44cf5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index acd122aa53..5d53f58f09 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,25 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's minusinfkey field.  This can be
+ *		thought of as explicitly representing that absent attributes in scan
+ *		key have minus infinity values.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,15 +97,39 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+
+	/*
+	 * Only heapkeyspace indexes support the "no minus infinity keys"
+	 * optimization.  !heapkeyspace indexes don't actually have minus infinity
+	 * attributes, but this allows us to avoid checking heapkeyspace
+	 * separately (explicit representation of number of key attributes in v3
+	 * indexes shouldn't confuse tie-breaker logic).
+	 *
+	 * There is never a need to explicitly represent truncated attributes as
+	 * having minus infinity values.  The only caller that may truly need to
+	 * search for negative infinity is the page deletion code.  It is
+	 * sufficient to omit trailing truncated attributes from the scankey
+	 * returned to that caller because caller relies on the fact that there
+	 * cannot be duplicate pivot tuples on a level within heapkeyspace
+	 * indexes.  Caller also opts out of the "no minus infinity key"
+	 * optimization, so search moves left on scankey-equal downlink in parent,
+	 * allowing VACUUM caller to reliably relocate leaf page undergoing
+	 * deletion.  See also: _bt_compare()'s handling of minusinfkey .
+	 */
+	key->minusinfkey = !key->heapkeyspace;
 	key->savebinsrch = key->restorebinsrch = false;
 	key->low = key->stricthigh = InvalidOffsetNumber;
 	key->nextkey = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -103,9 +145,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are
-		 * defensively represented as NULL values.  They should never be
-		 * used.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values.
+		 * They should never be used.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2043,38 +2085,234 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only usable value.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 2/3 tuples
+	 * across Postgres versions; don't allow new pivot tuples to have
+	 * truncated key attributes there.  _bt_compare() treats truncated key
+	 * attributes as having the value minus infinity, which would break
+	 * searches within !heapkeyspace indexes.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2088,15 +2326,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2116,16 +2356,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2135,8 +2385,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2148,7 +2405,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2159,18 +2420,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7870f282d2..167b79f2f4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,45 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree versions 2 and 3, so that they don't
+ * need to be immediately re-indexed at pg_upgrade.  In order to get the
+ * new heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is mostly the same as version 3.  There are two new
+ * fields in the metapage that were introduced in version 3.  A version 2
+ * metapage will be automatically upgraded to version 3 on the first
+ * insert to it.  INCLUDE indexes cannot use version 2.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +214,73 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
+ * tuples (non-pivot tuples).
  *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * Non-pivot tuple format:
  *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tie-breaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set.
+ *
+ * All other types of index tuples (collectively, "pivot" tuples) only
+ * have key columns, since pivot tuples only ever need to represent how
+ * the key space is separated.  In general, any B-Tree index that has more
+ * than one level (i.e. any index that does not just consist of a metapage
+ * and a single leaf root page) must have some number of pivot tuples,
+ * since pivot tuples are used for traversing the tree.  Suffix truncation
+ * can omit trailing key columns when a new pivot is formed, which makes
+ * minus infinity their logical value.  Since BTREE_VERSION 4 indexes
+ * treat heap TID as a trailing key column that ensures that all index
+ * tuples are unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set.  In
+ * that case, the number key columns is implicitly the same as the number
+ * of key columns in the index.  It is never set on version 2 indexes,
+ * which predate the introduction of INCLUDE indexes. (Only explicitly
+ * truncated pivot tuples explicitly represent the number of key columns
+ * on version 3, whereas all pivot tuples are formed using truncation on
+ * version 4.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +303,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +321,52 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Only BTREE_VERSION 4 indexes treat heap TID as a tie-breaker key attribute.
+ * This macro can be used with tuples from indexes that use earlier versions,
+ * even though the result won't be meaningful.  The expectation is that higher
+ * level code will ensure that the result is never used, for example by never
+ * providing a scantid that the result is compared against.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -325,26 +431,47 @@ typedef BTStackData *BTStack;
  * be confused with a search scankey).  It's used to descend a B-Tree using
  * _bt_search.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
+ * minusinfkey controls an optimization used by heapkeyspace indexes.
+ * Searches that are not specifically interested in keys with the value minus
+ * infinity (all searches bar those performed by VACUUM for page deletion)
+ * apply the optimization by setting the field to false.  See _bt_compare()
+ * for a full explanation.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.  See _bt_binsrch and _bt_findinsertloc for
-	 * details.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.  See _bt_binsrch and
+	 * _bt_findinsertloc for details.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +479,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
+	bool		minusinfkey;
 	bool		nextkey;
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +712,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +766,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..6320a0098f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
-- 
2.17.1

v15-0006-Add-high-key-continuescan-optimization.patchapplication/octet-stream; name=v15-0006-Add-high-key-continuescan-optimization.patchDownload

From 3f2d592076de4e4f7366d78771ee0c546d35b181 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v15 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0a6cf96bc9..653546c135 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1390,6 +1390,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1438,16 +1439,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2f0124d289..8e2841d94a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1363,11 +1363,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1377,6 +1380,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1390,24 +1394,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1419,6 +1420,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1430,11 +1432,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1565,8 +1580,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1583,6 +1598,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

v15-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v15-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 9d5dc36128e0bff6cc2fd758dcec7c2d249b1dc2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v15 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

#78

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#77)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 08/03/2019 12:22, Peter Geoghegan wrote:

I would like to work through these other items with you
(_bt_binsrch_insert() and so on), but I think that it would be helpful
if you made an effort to understand the minusinfkey stuff first. I
spent a lot of time improving the explanation of that within
_bt_compare(). It's important.

Ok, after thinking about it for a while, I think I understand the minus
infinity stuff now. Let me try to explain it in my own words:

Imagine that you have an index with two key columns, A and B. The index
has two leaf pages, with the following items:

+--------+ +--------+
| Page 1 | | Page 2 |
| | | |
| 1 1 | | 2 1 |
| 1 2 | | 2 2 |
| 1 3 | | 2 3 |
| 1 4 | | 2 4 |
| 1 5 | | 2 5 |
+--------+ +--------+

The key space is neatly split on the first key column - probably thanks
to the new magic in the page split code.

Now, what do we have as the high key of page 1? Answer: "2 -inf". The
"-inf" is not stored in the key itself, the second key column is just
omitted, and the search code knows to treat it implicitly as a value
that's lower than any real value. Hence, "minus infinity".

However, during page deletion, we need to perform a search to find the
downlink pointing to a leaf page. We do that by using the leaf page's
high key as the search key. But the search needs to treat it slightly
differently in that case. Normally, searching with a single key value,
"2", we would land on page 2, because any real value beginning with "2"
would be on that page, but in the page deletion case, we want to find
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
to do that.

Question: Wouldn't it be more straightforward to use "1 +inf" as page
1's high key? I.e treat any missing columns as positive infinity. That
way, the search for page deletion wouldn't need to be treated
differently. That's also how this used to work: all tuples on a page
used to be <= high key, not strictly < high key. And it would also make
the rightmost page less of a special case: you could pretend the
rightmost page has a pivot tuple with all columns truncated away, which
means positive infinity.

You have this comment _bt_split which touches the subject:

/*
* The "high key" for the new left page will be the first key that's going
* to go into the new right page, or possibly a truncated version if this
* is a leaf page split. This might be either the existing data item at
* position firstright, or the incoming tuple.
*
* The high key for the left page is formed using the first item on the
* right page, which may seem to be contrary to Lehman & Yao's approach of
* using the left page's last item as its new high key when splitting on
* the leaf level. It isn't, though: suffix truncation will leave the
* left page's high key fully equal to the last item on the left page when
* two tuples with equal key values (excluding heap TID) enclose the split
* point. It isn't actually necessary for a new leaf high key to be equal
* to the last item on the left for the L&Y "subtree" invariant to hold.
* It's sufficient to make sure that the new leaf high key is strictly
* less than the first item on the right leaf page, and greater than or
* equal to (not necessarily equal to) the last item on the left leaf
* page.
*
* In other words, when suffix truncation isn't possible, L&Y's exact
* approach to leaf splits is taken. (Actually, even that is slightly
* inaccurate. A tuple with all the keys from firstright but the heap TID
* from lastleft will be used as the new high key, since the last left
* tuple could be physically larger despite being opclass-equal in respect
* of all attributes prior to the heap TID attribute.)
*/

But it doesn't explain why it's done like that.

- Heikki

#79

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#78)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Now, what do we have as the high key of page 1? Answer: "2 -inf". The
"-inf" is not stored in the key itself, the second key column is just
omitted, and the search code knows to treat it implicitly as a value
that's lower than any real value. Hence, "minus infinity".

Right.

However, during page deletion, we need to perform a search to find the
downlink pointing to a leaf page. We do that by using the leaf page's
high key as the search key. But the search needs to treat it slightly
differently in that case. Normally, searching with a single key value,
"2", we would land on page 2, because any real value beginning with "2"
would be on that page, but in the page deletion case, we want to find
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code
to do that.

Right.

Question: Wouldn't it be more straightforward to use "1 +inf" as page
1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

* We have always taken the new high key from the firstright item, and
we also continue to do that on internal pages -- same as before. It
would certainly complicate the nbtsplitloc.c code to have to deal with
this new special case now (leaf and internal pages would have to have
far different handling, not just slightly different handling).

* We have always had "-inf" values as the first item on an internal
page, which is explicitly truncated to zero attributes as of Postgres
v11. It seems ugly to me to make truncated attributes mean negative
infinity in that context, but positive infinity in every other
context.

* Another reason that I prefer "-inf" to "+inf" is that you can
imagine an implementation that makes pivot tuples into normalized
binary keys, that are truncated using generic/opclass-agnostic logic,
and compared using strcmp(). If the scankey binary string is longer
than the pivot tuple, then it's greater according to strcmp() -- that
just works. And, you can truncate the original binary strings built
using opclass infrastructure without having to understand where
attributes begin and end (though this relies on encoding things like
NULL-ness a certain way). If we define truncation to be "+inf" now,
then none of this works.

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

Even calling it minusinfkey is misleading in one way, because we're
not so much searching for "-inf" values as we are searching for the
first page that could have tuples for the untruncated attributes. But
isn't that how this has always worked, given that we've had to deal
with duplicate pivot tuples on the same level before now? As I said,
we're not doing an extra thing when minusinfykey is true (during page
deletion) -- it's the other way around. Saying that we're searching
for minus infinity values for the truncated attributes is kind of a
lie, although the search does behave that way.

That way, the search for page deletion wouldn't need to be treated
differently. That's also how this used to work: all tuples on a page
used to be <= high key, not strictly < high key.

That isn't accurate -- it still works that way on the leaf level. The
alternative that you've described is possible, I think, but the key
space works just the same with either of our approaches. You've merely
thought of an alternative way of generating new high keys that satisfy
the same invariants as my own scheme. Provided the new separator for
high key is >= last item on the left and < first item on the right,
everything works.

As you point out, the original Lehman and Yao rule for leaf pages
(which Postgres kinda followed before) is that the high key is <=
items on the leaf level. But this patch makes nbtree follow that rule
fully and properly.

Maybe you noticed that amcheck tests < on internal pages, and only
checks <= on leaf pages. Perhaps it led you to believe that I did
things differently. Actually, this is classic Lehman and Yao. The keys
in internal pages are all "separators" as far as Lehman and Yao are
concerned, so the high key is less of a special case on internal
pages. We check < on internal pages because all separators are
supposed to be unique on a level. But, as I said, we do check <= on
the leaf level.

Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if
this is unclear. That shows that internal pages have unique keys -- we
can therefore expect the high key to be < items in internal pages. It
also shows that leaf pages copy the high key from the last item on the
left page -- we can expect the high key to be <= items there. Just
like with the patch, in effect. The comment from _bt_split() that you
quoted explains why what we do is like what Lehman and Yao do when
suffix truncation cannot truncate anything -- the new high key on the
left page comes from the last item on the left page.

And it would also make
the rightmost page less of a special case: you could pretend the
rightmost page has a pivot tuple with all columns truncated away, which
means positive infinity.

But we do already pretend that. How is that not the case already?

But it doesn't explain why it's done like that.

It's done this way because that's equivalent to what Lehman and Yao
do, while also avoiding adding the special cases that I mentioned (in
nbtsplitloc.c, and so on).

--
Peter Geoghegan

#80

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#79)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:

Question: Wouldn't it be more straightforward to use "1 +inf" as page
1's high key? I.e treat any missing columns as positive infinity.

That might also work, but it wouldn't be more straightforward on
balance. This is because:

I thought of another reason:

* The 'Add high key "continuescan" optimization' is effective because
the high key of a leaf page tends to look relatively dissimilar to
other items on the page. The optimization would almost never help if
the high key was derived from the lastleft item at the time of a split
-- that's no more informative than the lastleft item itself.

As things stand with the patch, a high key usually has a value for its
last untruncated attribute that can only appear on the page to the
right, and never the current page. We'd quite like to be able to
conclude that the page to the right can't be interesting there and
then, without needing to visit it. Making new leaf high keys "as close
as possible to items on the right, without actually touching them"
makes the optimization quite likely to work out with the TPC-C
indexes, when we search for orderline items for an order that is
rightmost of a leaf page in the orderlines primary key.

And another reason:

* This makes it likely that any new items that would go between the
original lastleft and firstright items end up on the right page when
they're inserted after the lastleft/firstright split. This is
generally a good thing, because we've optimized WAL-logging for new
pages that go on the right. (You pointed this out to me in Lisbon, in
fact.)

--
Peter Geoghegan

#81

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#79)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

Thanks
--
Peter Geoghegan

#82

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#81)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 08/03/2019 23:21, Peter Geoghegan wrote:

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.

This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?

"descendrighttrunc" doesn't make much sense to me, either. I don't
understand it. Maybe a comment would make it clear, though.

I don't feel like this is an optimization. It's natural consequence of
what the high key means. I guess you can think of it as an optimization,
in the same way that not fully scanning the whole index for every search
is an optimization, but that's not how I think of it :-).

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
* When set, we're searching for the leaf page with the given high key,
* rather than leaf tuples matching the search keys.
*
* Normally, when !searching_for_pivot_tuple, if a page's high key
* has truncated columns, and the columns that are present are equal to
* the search key, the search will not descend to that page. For
* example, if an index has two columns, and a page's high key is
* ("foo", <omitted>), and the search key is also ("foo," <omitted>),
* the search will not descend to that page, but its right sibling. The
* omitted column in the high key means that all tuples on the page must
* be strictly < "foo", so we don't need to visit it. However, sometimes
* we perform a search to find the parent of a leaf page, using the leaf
* page's high key as the search key. In that case, when we search for
* ("foo", <omitted>), we do want to land on that page, not its right
* sibling.
*/
bool searching_for_leaf_page;

As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)

- Heikki

#83

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#82)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

"descendrighttrunc" doesn't make much sense to me, either. I don't
understand it. Maybe a comment would make it clear, though.

It's not an easily grasped concept. I don't think that any name will
easily convey the idea to the reader, though. I'm happy to go with
whatever name you prefer.

I don't feel like this is an optimization. It's natural consequence of
what the high key means. I guess you can think of it as an optimization,
in the same way that not fully scanning the whole index for every search
is an optimization, but that's not how I think of it :-).

I would agree with this in a green field situation, where we don't
have to consider the legacy of v3 indexes. But that's not the case
here.

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
* When set, we're searching for the leaf page with the given high key,
* rather than leaf tuples matching the search keys.
*
* Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

I'm fine with "searching_for_pivot_tuple", I think. The underscores
are not really stylistically consistent with other stuff in nbtree.h,
but I can use something very similar to your suggestion that is
consistent.

* has truncated columns, and the columns that are present are equal to
* the search key, the search will not descend to that page. For
* example, if an index has two columns, and a page's high key is
* ("foo", <omitted>), and the search key is also ("foo," <omitted>),
* the search will not descend to that page, but its right sibling. The
* omitted column in the high key means that all tuples on the page must
* be strictly < "foo", so we don't need to visit it. However, sometimes
* we perform a search to find the parent of a leaf page, using the leaf
* page's high key as the search key. In that case, when we search for
* ("foo", <omitted>), we do want to land on that page, not its right
* sibling.
*/
bool searching_for_leaf_page;

That works for me (assuming you meant searching_for_pivot_tuple).

As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare(). I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Now, it's also true that there are a number of things that we have to
do within nbtinsert.c for !heapkeyspace that don't really concern the
key space as such. Even still, thinking about everything with
reference to the keyspace, and keeping that as similar as possible
between v3 and v4 is a good thing. It is up to high level code (such
as _bt_first()) to not allow a !heapkeyspace index scan to do
something that won't work for it. It is not up to low level code like
_bt_compare() to worry about these differences (beyond asserting that
caller got it right). If page deletion didn't need minusinfkey
semantics (if nobody but v3 indexes needed that), then I would just
make the "move right of separator" !minusinfkey code within
_bt_compare() test heapkeyspace. But we do have a general need for
minusinfkey semantics, so it seems simpler and more future-proof to
keep heapkeyspace out of low-level nbtsearch.c code.

--
Peter Geoghegan

#84

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#83)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 10/03/2019 20:53, Peter Geoghegan wrote:

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
* When set, we're searching for the leaf page with the given high key,
* rather than leaf tuples matching the search keys.
*
* Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
changed to "searching_for_leaf_page" at the last minute. My thinking was
that in the page-deletion case, you're trying to re-locate a particular
leaf page. Otherwise, you're searching for matching tuples, not a
particular page. Although during insertion, I guess you are also
searching for a particular page, the page to insert to.

As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare().

Yeah.

I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

It's a hot codepath, but I doubt it's *that* hot that it matters,
performance-wise...

In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Yeah. I find that's a complicated way to think about it. My mental model
is that v4 indexes store heap TIDs, and every tuple is unique thanks to
that. But on v3, we don't store heap TIDs, and duplicates are possible.

- Heikki

#85

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#84)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
changed to "searching_for_leaf_page" at the last minute. My thinking was
that in the page-deletion case, you're trying to re-locate a particular
leaf page. Otherwise, you're searching for matching tuples, not a
particular page. Although during insertion, I guess you are also
searching for a particular page, the page to insert to.

I prefer something like "searching_for_pivot_tuple", because it's
unambiguous. Okay with that?

It's a hot codepath, but I doubt it's *that* hot that it matters,
performance-wise...

I'll figure that out. Although I am currently looking into a
regression in workloads that fit in shared_buffers, that my
micro-benchmarks didn't catch initially. Indexes are still much
smaller, but we get a ~2% regression all the same. OTOH, we get a
7.5%+ increase in throughput when the workload is I/O bound, and
latency is generally no worse and even better with any workload.

I suspect that the nice top-down approach to nbtsplitloc.c has its
costs...will let you know more when I know more.

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Yeah. I find that's a complicated way to think about it. My mental model
is that v4 indexes store heap TIDs, and every tuple is unique thanks to
that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

--
Peter Geoghegan

#86

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#85)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 10, 2019 at 1:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Yeah. I find that's a complicated way to think about it. My mental model
is that v4 indexes store heap TIDs, and every tuple is unique thanks to
that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

Attached is v16, which does it that way instead. There are simpler
comments, still located within _bt_compare(). These are based on your
suggested wording, with some changes. I think that I prefer it this
way too. Please let me know what you think.

Other changes:

* nbtsplitloc.c failed to consider the full range of values in the
split interval when deciding perfect penalty. It considered from the
middle to the left or right edge, rather than from the left edge to
the right edge. This didn't seem to really effect the quality of its
decisions very much, but it was still wrong. This is fixed by a new
function that determines the left and right edges of the split
interval -- _bt_interval_edges().

* We now record the smallest observed tuple during our pass over the
page to record split points. This is used by internal page splits, to
get a more useful "perfect penalty", saving cycles in the common case
where there isn't much variability in the size of tuples on the page
being split. The same field is used within the "split after new item"
optimization as a further crosscheck -- it's now impossible to fool it
into thinking that the page has equisized tuples.

The regression that I mentioned earlier isn't in pgbench type
workloads (even when the distribution is something more interesting
that the uniform distribution default). It is only in workloads with
lots of page splits and lots of index churn, where we get most of the
benefit of the patch, but also where the costs are most apparent.
Hopefully it can be fixed, but if not I'm inclined to think that it's
a price worth paying. This certainly still needs further analysis and
discussion, though. This revision of the patch does not attempt to
address that problem in any way.

--
Peter Geoghegan

Attachments:

v16-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v16-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From b5ef0ae9a8b3ece4e639863920a994c9eb9a2019 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v16 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v16-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchapplication/octet-stream; name=v16-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From 6a0f2f3945caeacb4c99992d8bfa7f7d7b027fc7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v16 4/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 131 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 155 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e769de3690..aa4b0c9049 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -74,6 +74,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -123,10 +125,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -139,6 +142,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -176,7 +180,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -195,11 +199,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -208,7 +215,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -266,7 +274,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -337,7 +345,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -361,6 +369,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -429,6 +438,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -921,6 +938,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1525,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1925,6 +1971,75 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal an inconsistency.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		OffsetNumber offnum;
+		Page		page;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch(state->rel, key, lbuf);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v16-0003-Consider-secondary-factors-during-nbtree-splits.patchapplication/octet-stream; name=v16-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From fbc19392835551dc4e5edeb84beb96cf46be3c0e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v16 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pre-pg_upgrade'd v3 indexes make use of these
optimizations.  Benchmarking has shown that even v3 indexes benefit,
despite the fact that suffix truncation will only truncate non-key
attributes in INCLUDE indexes.  Grouping relatively similar tuples
together is beneficial in and of itself, since it reduces the number of
leaf pages that must be accessed by subsequent index scans.
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 290 +-------
 src/backend/access/nbtree/nbtsplitloc.c | 846 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 956 insertions(+), 293 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40ff25fe06..ca4fdf7ac4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -661,6 +661,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 80f695229b..1367050718 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -76,13 +56,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -920,8 +893,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page.
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -1016,7 +988,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1700,264 +1672,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..ead218d916
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,846 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval (default strategy only) */
+#define MAX_LEAF_INTERVAL			9
+#define MAX_INTERNAL_INTERVAL		18
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} FindSplitStrat;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int16		curdelta;		/* current leftfree/rightfree delta */
+	int16		leftfree;		/* space left on left page post-split */
+	int16		rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recsplitloc */
+	Relation	rel;			/* index relation */
+	Page		page;			/* page undergoing split */
+	IndexTuple	newitem;		/* new item (cause of page split) */
+	Size		newitemsz;		/* size of newitem (includes line pointer) */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+	Size		minfirstrightsz;	/* smallest firstoldonright tuple size */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			interval;		/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
+				 bool *newitemonleft);
+static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy);
+static void _bt_interval_edges(FindSplitData *state,
+				   SplitPoint **leftinterval, SplitPoint **rightinterval);
+static inline int _bt_split_penalty(FindSplitData *state, SplitPoint *split);
+static inline IndexTuple _bt_split_lastleft(FindSplitData *state,
+				   SplitPoint *split);
+static inline IndexTuple _bt_split_firstright(FindSplitData *state,
+					 SplitPoint *split);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * there are a number of further special cases where fillfactor is not
+ * applied in the standard way.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	FindSplitStrat strategy;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	double		fillfactormult;
+	bool		usemult;
+	SplitPoint	leftpage,
+				rightpage;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.rel = rel;
+	state.page = page;
+	state.newitem = newitem;
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.minfirstrightsz = SIZE_MAX;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all points
+	 * between tuples will be legal).
+	 */
+	state.maxsplits = maxoff;
+	state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (any other split whose firstoldonright is the first data offset
+	 * can't be legal, though, and so won't actually end up being recorded in
+	 * first loop iteration).
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default strategy), while applying
+	 * rightmost where appropriate.  Either of the two other fallback
+	 * strategies may be required for cases with a large number of duplicates
+	 * around the original/space-optimal split point.
+	 *
+	 * Default strategy gives some weight to suffix truncation in deciding a
+	 * split point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = state.is_rightmost;
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (state.is_rightmost)
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		/* fillfactormult not used, but be tidy */
+		fillfactormult = 0.50;
+	}
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	state.interval = Min(Max(1, state.nsplits * 0.05),
+						 state.is_leaf ? MAX_LEAF_INTERVAL :
+						 MAX_INTERNAL_INTERVAL);
+
+	/*
+	 * Save leftmost and rightmost splits for page before original ordinal
+	 * sort order is lost by delta/fillfactormult sort
+	 */
+	leftpage = state.splits[0];
+	rightpage = state.splits[state.nsplits - 1];
+
+	/* Give split points a fillfactormult-wise delta, and sort on deltas */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Determine if default strategy/split interval will produce a
+	 * sufficiently distinguishing split, or if we should change strategies.
+	 * Alternative strategies change the range of split points that are
+	 * considered acceptable (split interval), and possibly change
+	 * fillfactormult, in order to deal with pages with a large number of
+	 * duplicates gracefully.
+	 *
+	 * Pass low and high splits for the entire page (including even newitem).
+	 * These are used when the initial split interval encloses split points
+	 * that are full of duplicates, and we need to consider if it's even
+	 * possible to avoid appending a heap TID.
+	 */
+	perfectpenalty = _bt_strategy(&state, &leftpage, &rightpage, &strategy);
+
+	if (strategy == SPLIT_DEFAULT)
+	{
+		/*
+		 * Default strategy worked out (always works out with internal page).
+		 * Original split interval still stands.
+		 */
+	}
+
+	/*
+	 * Many duplicates strategy is used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates.
+	 *
+	 * The split interval is widened to include all legal candidate split
+	 * points.  There may be a few as two distinct values in the whole-page
+	 * split interval.  Many duplicates strategy has no hard requirements for
+	 * space utilization, though it still keeps the use of space balanced as a
+	 * non-binding secondary goal (perfect penalty is set so that the
+	 * first/lowest delta split points that avoids appending a heap TID is
+	 * used).
+	 *
+	 * Single value strategy is used when it is impossible to avoid appending
+	 * a heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently.  (Single value strategy is harmless though not
+	 * particularly useful with !heapkeyspace indexes.)
+	 */
+	else if (strategy == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.interval = state.nsplits;
+	}
+	else if (strategy == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split near the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		/* Appending a heap TID is unavoidable, so interval of 1 is fine */
+		state.interval = 1;
+	}
+
+	/*
+	 * Search among acceptable split points (using final split interval) for
+	 * the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(&state, perfectpenalty, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int16		leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int16) (firstrightitemsz +
+							 MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int16) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int16) state->newitemsz;
+	else
+		rightfree -= (int16) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int16) firstrightitemsz -
+			(int16) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		Assert(state->nsplits < state->maxsplits);
+
+		/* Determine smallest firstright item size on page */
+		state->minfirstrightsz = Min(state->minfirstrightsz, firstrightitemsz);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int16		delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty among split points
+ * that fall within current/final split interval.  Penalty is an abstract
+ * score, with a definition that varies depending on whether we're splitting a
+ * leaf page or an internal page.  See _bt_split_penalty() for details.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice (or when we only want a
+ * minimally distinguishing split point, and don't want to make the split any
+ * more unbalanced than is necessary).
+ *
+ * We return the index of the first existing tuple that should go on the right
+ * page, plus a boolean indicating if new item is on left of split point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(FindSplitData *state, int perfectpenalty, bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->interval, state->nsplits);
+
+	/* No point in calculating penalty when there's only one choice */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(state, state->splits + i);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to decide whether split should use default strategy/initial
+ * split interval, or whether it should finish splitting the page using
+ * alternative strategies (this is only possible with leaf pages).
+ *
+ * Caller uses alternative strategy (or sticks with default strategy) based
+ * on how *strategy is set here.  Return value is "perfect penalty", which is
+ * passed to _bt_bestsplitloc() as a final constraint on how far caller is
+ * willing to go to avoid appending a heap TID when using the many duplicates
+ * strategy (it also saves _bt_bestsplitloc() useless cycles).
+ */
+static int
+_bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy)
+{
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *leftinterval,
+			   *rightinterval;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume that alternative strategy won't be used for now */
+	*strategy = SPLIT_DEFAULT;
+
+	/*
+	 * Use smallest observed first right item size for entire page as perfect
+	 * penalty on internal pages.  This can save cycles in the common case
+	 * where most or all splits (not just splits within interval) have first
+	 * right tuples that are the same size.
+	 */
+	if (!state->is_leaf)
+		return state->minfirstrightsz;
+
+	/*
+	 * Use leftmost and rightmost tuples from leftmost and rightmost splits in
+	 * current split interval
+	 */
+	_bt_interval_edges(state, &leftinterval, &rightinterval);
+	leftmost = _bt_split_lastleft(state, leftinterval);
+	rightmost = _bt_split_firstright(state, rightinterval);
+
+	/*
+	 * If initial split interval can produce a split point that will at least
+	 * avoid appending a heap TID in new high key, we're done.  Finish split
+	 * with default strategy and initial split interval.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+		return perfectpenalty;
+
+	/*
+	 * Work out how caller should finish split when even their "perfect"
+	 * penalty for initial/default split interval indicates that the interval
+	 * does not contain even a single split that avoids appending a heap TID.
+	 *
+	 * Use the leftmost split's lastleft tuple and the rightmost split's
+	 * firstright tuple to assess every possible split.
+	 */
+	leftmost = _bt_split_lastleft(state, leftpage);
+	rightmost = _bt_split_firstright(state, rightpage);
+
+	/*
+	 * If page (including new item) has many duplicates but is not entirely
+	 * full of duplicates, a many duplicates strategy split will be performed.
+	 * If page is entirely full of duplicates, a single value strategy split
+	 * will be performed.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+	{
+		*strategy = SPLIT_MANY_DUPLICATES;
+
+		/*
+		 * Caller should choose the lowest delta split that avoids appending a
+		 * heap TID.  Maximizing the number of attributes that can be
+		 * truncated away (returning perfectpenalty when it happens to be less
+		 * than the number of key attributes in index) can result in continual
+		 * unbalanced page splits.
+		 *
+		 * Just avoiding appending a heap TID can still make splits very
+		 * unbalanced, but this is self-limiting.  When final split has a very
+		 * high delta, one side of the split will likely consist of a single
+		 * value.  If that page is split once again, then that split will
+		 * likely use the single value strategy.
+		 */
+		return indnkeyatts;
+	}
+
+	/*
+	 * Single value strategy is only appropriate with ever-increasing heap
+	 * TIDs; otherwise, original default strategy split should proceed to
+	 * avoid pathological performance.  Use page high key to infer if this is
+	 * the rightmost page among pages that store the same duplicate value.
+	 * This should not prevent insertions of heap TIDs that are slightly out
+	 * of order from using single value strategy, since that's expected with
+	 * concurrent inserters of the same duplicate value.
+	 */
+	else if (state->is_rightmost)
+		*strategy = SPLIT_SINGLE_VALUE;
+	else
+	{
+		ItemId		itemid;
+		IndexTuple	hikey;
+
+		itemid = PageGetItemId(state->page, P_HIKEY);
+		hikey = (IndexTuple) PageGetItem(state->page, itemid);
+		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
+											 state->newitem);
+		if (perfectpenalty <= indnkeyatts)
+			*strategy = SPLIT_SINGLE_VALUE;
+		else
+		{
+			/*
+			 * Have caller finish split using default strategy, since page
+			 * does not appear to be the rightmost page for duplicates of the
+			 * value the page is filled with
+			 */
+		}
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to locate leftmost and rightmost splits for current/default
+ * split interval.  Note that it will be the same split iff there is only one
+ * split in interval.
+ */
+static void
+_bt_interval_edges(FindSplitData *state, SplitPoint **leftinterval,
+				   SplitPoint **rightinterval)
+{
+	int			highsplit = Min(state->interval, state->nsplits);
+	SplitPoint *deltaoptimal;
+
+	deltaoptimal = state->splits;
+	*leftinterval = NULL;
+	*rightinterval = NULL;
+
+	/*
+	 * Delta is an absolute distance to optimal split point, so both the
+	 * leftmost and rightmost split point will usually be at the end of the
+	 * array
+	 */
+	for (int i = highsplit - 1; i >= 0; i--)
+	{
+		SplitPoint *distant = state->splits + i;
+
+		if (distant->firstoldonright < deltaoptimal->firstoldonright)
+		{
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->firstoldonright > deltaoptimal->firstoldonright)
+		{
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else if (!distant->newitemonleft && deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become first on right page" (distant) is
+			 * to the left of "incoming tuple will become last on left page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->newitemonleft && !deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become last on left page" (distant) is to
+			 * the right of "incoming tuple will become first on right page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else
+		{
+			/* There was only one or two splits in initial split interval */
+			Assert(distant == deltaoptimal);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+
+		if (*leftinterval && *rightinterval)
+			return;
+	}
+
+	Assert(false);
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (including item pointer overhead).  This tuple will
+ * become the new high key for the left page.
+ */
+static inline int
+_bt_split_penalty(FindSplitData *state, SplitPoint *split)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	if (!state->is_leaf)
+	{
+		ItemId		itemid;
+
+		if (!split->newitemonleft &&
+			split->firstoldonright == state->newitemoff)
+			return state->newitemsz;
+
+		itemid = PageGetItemId(state->page, split->firstoldonright);
+
+		return MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+	}
+
+	lastleftuple = _bt_split_lastleft(state, split);
+	firstrighttuple = _bt_split_firstright(state, split);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(state->rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page,
+						   OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c5ec8e132d..1b09ab8d6a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2297,6 +2298,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 197bd6cf52..f7f8011e33 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -160,11 +160,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -700,6 +704,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -766,6 +777,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v16-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchapplication/octet-stream; name=v16-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patchDownload

From fa60d386a11eef4cb46adb6b0335bc6003b06fd0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v16 2/7] Make heap TID a tie-breaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tie-breaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing the newer/more strict invariants with
version 4 indexes.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.  The new maximum allowed size is 2704 bytes on 64-bit
systems, down from 2712 bytes.
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 331 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 136 +++---
 src/backend/access/nbtree/nbtinsert.c        | 302 +++++++++-----
 src/backend/access/nbtree/nbtpage.c          | 200 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 117 +++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 412 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  43 +-
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 209 ++++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   8 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 28 files changed, 1553 insertions(+), 518 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 053ac9d192..e769de3690 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -45,6 +45,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -66,6 +68,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -122,7 +126,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -137,17 +141,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -204,6 +213,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -254,7 +264,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -324,8 +336,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -346,6 +358,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -806,7 +819,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -839,6 +853,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -865,7 +880,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -906,7 +922,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -940,9 +1005,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -968,11 +1059,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1035,7 +1125,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1213,9 +1303,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1304,7 +1394,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup);
 }
 
 /*
@@ -1367,7 +1457,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1416,14 +1507,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1855,6 +1961,61 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  However, it is not capable of determining that a
+	 * scankey is _less than_ a tuple on the basis of a comparison resolved at
+	 * _scankey_ minus infinity attribute.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1874,42 +2035,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2065,3 +2268,53 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically prevents insertion scankey from
+ * being considered greater than the pivot tuple that its values originated
+ * from (or some other identical pivot tuple) in the common case where there
+ * are truncated/minus infinity attributes.  Without this extra step, there
+ * are forms of corruption that amcheck could theoretically fail to report.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !pivotsearch tie-breaker in _bt_compare()
+ * might otherwise cause amcheck to assume (rather than actually verify) that
+ * the scankey is greater.
+ */
+static inline BTScanInsert
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->pivotsearch = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f427b312..21c978503a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   B-tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a..cb23be859d 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a295a7a286..40ff25fe06 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -615,22 +626,40 @@ scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -643,20 +672,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -664,4 +699,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4284a5ff42..80f695229b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -64,14 +64,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
 					 bool *restorebinsrch, Size itemsz);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -118,6 +120,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -223,12 +228,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tie-breaker attribute.  Any other would-be inserter
+	 * of the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -265,6 +271,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -273,12 +283,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
 
@@ -290,8 +300,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
 									   itup, stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
-					   false);
+		_bt_insertonpg(rel, itup_key, buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
@@ -361,6 +371,7 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(itup_key->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -630,16 +641,16 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, *bufptr contains the page that the new tuple belongs on.
+ *		Occasionally, this won't be exactly right for callers that just
+ *		called _bt_check_unique(), and did initial search without using a
+ *		scantid.  They'll have to insert into a page somewhere to the right
+ *		in rare cases where there are many physical duplicates in a unique
+ *		index, and their scantid directs us to some page full of duplicates
+ *		to the right, where the new tuple must go.  (Actually, since
+ *		!heapkeyspace pg_upgraded'd non-unique indexes never get a scantid,
+ *		they too may require that we move right.  We treat them somewhat like
+ *		unique indexes.)
  *
  *		_bt_check_unique() saves the progress of the binary search it
  *		performs, in the insertion scan key.  In the common case that there
@@ -682,28 +693,26 @@ _bt_findinsertloc(Relation rel,
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 newtup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -711,6 +720,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * An earlier _bt_check_unique() call may well have established bounds
 		 * that we can use to skip the high key check for checkingunique
 		 * callers.  This fastpath cannot be used when there are no items on
@@ -718,8 +734,10 @@ _bt_findinsertloc(Relation rel,
 		 * new item belongs last on the page, but it might go on a later page
 		 * instead.
 		 */
-		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
-			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+				 itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		/*
@@ -729,15 +747,24 @@ _bt_findinsertloc(Relation rel,
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
-												&restorebinsrch, itemsz))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+													&restorebinsrch, itemsz))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -746,6 +773,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -816,9 +845,16 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
- *		This function handles the question of whether or not an insertion of
- *		a duplicate into an index should insert on the page contained in buf
- *		when a choice must be made.
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should insert
+ *		on the page contained in buf when a choice must be made.  It is only
+ *		used with pg_upgrade'd version 2 and version 3 indexes (!heapkeyspace
+ *		indexes).
  *
  *		If the current page doesn't have enough free space for the new tuple
  *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
@@ -911,6 +947,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -933,7 +970,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -983,8 +1020,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1066,7 +1103,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1121,6 +1158,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1188,17 +1227,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1292,7 +1333,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1306,8 +1348,29 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach of
+	 * using the left page's last item as its new high key when splitting on
+	 * the leaf level.  It isn't, though: suffix truncation will leave the
+	 * left page's high key fully equal to the last item on the left page when
+	 * two tuples with equal key values (excluding heap TID) enclose the split
+	 * point.  It isn't actually necessary for a new leaf high key to be equal
+	 * to the last item on the left for the L&Y "subtree" invariant to hold.
+	 * It's sufficient to make sure that the new leaf high key is strictly
+	 * less than the first item on the right leaf page, and greater than or
+	 * equal to (not necessarily equal to) the last item on the left leaf
+	 * page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap TID
+	 * from lastleft will be used as the new high key, since the last left
+	 * tuple could be physically larger despite being opclass-equal in respect
+	 * of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1325,25 +1388,50 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate unneeded key attributes of the high key item before inserting
+	 * it on the left page.  This can only happen at the leaf level, since in
+	 * general all pivot tuple values originate from leaf level high keys.
+	 * This isn't just about avoiding unnecessary work, though; truncating
+	 * unneeded key suffix attributes can only be performed at the leaf level
+	 * anyway.  This is because a pivot tuple in a grandparent page must guide
+	 * a search not only to the correct parent page, but also to the correct
+	 * leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page. This
+		 * is needed to decide how many attributes from the first item on the
+		 * right page must remain in new high key for left page.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1536,7 +1624,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1565,22 +1652,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1596,9 +1671,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1961,7 +2034,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1991,7 +2064,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2252,7 +2325,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2287,7 +2360,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2320,6 +2394,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2384,6 +2460,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2399,8 +2476,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2413,6 +2490,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..3e37d71316 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tie-breaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1364,7 +1436,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 * We need an approximate pointer to the page's parent page.  We
 			 * use the standard search mechanism to search for the page's high
 			 * key; this will give us a link to either the current parent or
-			 * someplace to its left (if there are multiple equal high keys).
+			 * someplace to its left (if there are multiple equal high keys,
+			 * which is possible with !heapkeyspace indexes).
 			 *
 			 * Also check if this is the right-half of an incomplete split
 			 * (see comment above).
@@ -1422,7 +1495,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* get stack to leaf page by searching index */
+				/* Search should not locate page with first non-pivot match */
+				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..ec2edae850 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 30be88eb82..1e3c2f638c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -359,6 +365,9 @@ _bt_binsrch(Relation rel,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	if (!key->restorebinsrch)
 	{
 		low = P_FIRSTDATAKEY(opaque);
@@ -368,6 +377,7 @@ _bt_binsrch(Relation rel,
 	else
 	{
 		/* Restore result of previous binary search against same page */
+		Assert(!key->heapkeyspace || key->scantid != NULL);
 		Assert(P_ISLEAF(opaque));
 		low = key->low;
 		high = key->stricthigh;
@@ -447,6 +457,7 @@ _bt_binsrch(Relation rel,
 	if (key->savebinsrch)
 	{
 		Assert(isleaf);
+		Assert(key->scantid == NULL);
 		key->low = low;
 		key->stricthigh = stricthigh;
 		key->savebinsrch = false;
@@ -478,10 +489,11 @@ _bt_binsrch(Relation rel,
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
+ * scankey.  The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon.  This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key.  See backend/access/nbtree/README for details.
  *----------
  */
 int32
@@ -493,10 +505,14 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -506,6 +522,7 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -519,8 +536,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -571,8 +590,77 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches have a scankey that is considered greater than a
+		 * truncated pivot tuple if and when the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in tuple.
+		 *
+		 * For example, if an index has the minimum two attributes (single
+		 * user key attribute, plus heap TID attribute), and a page's high key
+		 * is ("foo", -inf), and scankey is ("foo", <omitted>), the search
+		 * will not descend to the page to the left.  The search will descend
+		 * right instead.  The truncated attribute in pivot tuple means that
+		 * all non-pivot tuples on the page to the left must be strictly <
+		 * "foo", so it isn't necessary to visit it.  In other words, caller
+		 * doesn't have to descend to the left because it isn't interested in
+		 * a match that has a heap TID value of -inf.
+		 *
+		 * However, some callers (pivotsearch callers) actually require that
+		 * we descend left when this happens.  Minus infinity is treated as a
+		 * possible match for omitted scankey attributes.  This is useful for
+		 * page deletion, which needs to relocate a leaf page using its high
+		 * key, rather than relocating its right sibling (the right sibling is
+		 * the first page a non-pivot match can be found on).
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 *
+		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+		 * left here, since they have no heap TID attribute (and cannot have
+		 * truncated attributes in any case).  They must always be prepared to
+		 * deal with matches on both sides of the pivot once the leaf level is
+		 * reached.
+		 */
+		if (key->heapkeyspace && !key->pivotsearch &&
+			key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1089,7 +1177,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/* Initialize remaining insertion scan key fields */
 	inskey.savebinsrch = inskey.restorebinsrch = false;
 	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
 	inskey.nextkey = nextkey;
+	inskey.pivotsearch = false;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 759859c302..67cdb44cf5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -746,6 +746,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -799,8 +800,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -817,27 +816,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -883,24 +876,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -908,7 +912,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -927,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -973,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1032,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1126,6 +1136,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1133,7 +1145,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1150,6 +1161,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index acd122aa53..c5ec8e132d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,26 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an own
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's pivotsearch field to true.  This
+ *		allows caller to search for a leaf page with a matching high key,
+ *		which is usually to the left of the first leaf page a non-pivot match
+ *		might appear on.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an own ad-hoc comparison routine, or only need a
+ *		scankey for _bt_truncate()) can pass a NULL index tuple.  The
+ *		scankey will be initialized as if an "all truncated" pivot tuple
+ *		was passed instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,15 +98,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
 	key->savebinsrch = key->restorebinsrch = false;
 	key->low = key->stricthigh = InvalidOffsetNumber;
 	key->nextkey = false;
+	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -103,9 +127,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are
-		 * defensively represented as NULL values.  They should never be
-		 * used.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values. They
+		 * should never be used.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2043,38 +2067,234 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only usable value.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 2/3 tuples
+	 * across Postgres versions; don't allow new pivot tuples to have
+	 * truncated key attributes there.  _bt_compare() treats truncated key
+	 * attributes as having the value minus infinity, which would break
+	 * searches within !heapkeyspace indexes.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2088,15 +2308,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2116,16 +2338,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2135,8 +2367,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2148,7 +2387,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2159,18 +2402,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tie-breaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..876ff0c40f 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -284,6 +268,8 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		OffsetNumber off;
 		IndexTuple	newitem = NULL;
 		Size		newitemsz = 0;
+		IndexTuple	left_hikey = NULL;
+		Size		left_hikeysz = 0;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f97a82ae7b..5b7637883e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7870f282d2..197bd6cf52 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,45 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree versions 2 and 3, so that they don't
+ * need to be immediately re-indexed at pg_upgrade.  In order to get the
+ * new heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is mostly the same as version 3.  There are two new
+ * fields in the metapage that were introduced in version 3.  A version 2
+ * metapage will be automatically upgraded to version 3 on the first
+ * insert to it.  INCLUDE indexes cannot use version 2.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +214,73 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
+ * tuples (non-pivot tuples).
  *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * Non-pivot tuple format:
  *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tie-breaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set.
+ *
+ * All other types of index tuples (collectively, "pivot" tuples) only
+ * have key columns, since pivot tuples only ever need to represent how
+ * the key space is separated.  In general, any B-Tree index that has more
+ * than one level (i.e. any index that does not just consist of a metapage
+ * and a single leaf root page) must have some number of pivot tuples,
+ * since pivot tuples are used for traversing the tree.  Suffix truncation
+ * can omit trailing key columns when a new pivot is formed, which makes
+ * minus infinity their logical value.  Since BTREE_VERSION 4 indexes
+ * treat heap TID as a trailing key column that ensures that all index
+ * tuples are unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set.  In
+ * that case, the number key columns is implicitly the same as the number
+ * of key columns in the index.  It is never set on version 2 indexes,
+ * which predate the introduction of INCLUDE indexes. (Only explicitly
+ * truncated pivot tuples explicitly represent the number of key columns
+ * on version 3, whereas all pivot tuples are formed using truncation on
+ * version 4.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +303,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +321,52 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Only BTREE_VERSION 4 indexes treat heap TID as a tie-breaker key attribute.
+ * This macro can be used with tuples from indexes that use earlier versions,
+ * even though the result won't be meaningful.  The expectation is that higher
+ * level code will ensure that the result is never used, for example by never
+ * providing a scantid that the result is compared against.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -325,26 +431,45 @@ typedef BTStackData *BTStack;
  * be confused with a search scankey).  It's used to descend a B-Tree using
  * _bt_search.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tie-breaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * pivotsearch is set to true by callers that want to relocate a leaf page
+ * using a scankey built from leaf pages high key.  Most callers set this to
+ * false.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * scantid is the heap TID that is used as a final tie-breaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
+ *
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/*
 	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
-	 * search on the leaf level.  Only used for insertions where
-	 * _bt_check_unique is called.  See _bt_binsrch and _bt_findinsertloc for
-	 * details.
+	 * search on the leaf level when only scantid has changed.  Only used for
+	 * insertions where _bt_check_unique is called.  See _bt_binsrch and
+	 * _bt_findinsertloc for details.
 	 */
 	bool		savebinsrch;
 	bool		restorebinsrch;
@@ -352,7 +477,10 @@ typedef struct BTScanInsertData
 	OffsetNumber stricthigh;
 
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
 	bool		nextkey;
+	bool		pivotsearch;	/* Searching for pivot tuple? */
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -582,6 +710,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -635,8 +764,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..6320a0098f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 0e32d5c427..ac41419c7b 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..9c763ec184 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1998,9 +1998,9 @@ DROP TABLE temp_parted;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
-- 
2.17.1

v16-0001-Refactor-nbtree-insertion-scankeys.patchapplication/octet-stream; name=v16-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From 2da03646878c41237ec55af704e31dea850eb616 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v16 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache
_bt_binsrch() effort in an ad-hoc manner.  This makes it easy to add a
new optimization: _bt_check_unique() now falls out of its loop
immediately in the common case where it's already clear that there
couldn't possibly be a duplicate.  More importantly, the new
_bt_check_unique() scheme makes it a lot easier to manage cached binary
search effort afterwards, from within _bt_findinsertloc().  This is
needed for the upcoming patch to make nbtree tuples unique by treating
heap TID as a final tie-breaker column.

Based on a suggestion by Andrey Lepikhov.
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/README      |  29 ++-
 src/backend/access/nbtree/nbtinsert.c | 354 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 167 +++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  98 +++----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  61 ++++-
 9 files changed, 458 insertions(+), 339 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 964200a767..053ac9d192 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -126,9 +126,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -138,14 +138,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -837,8 +837,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1029,7 +1029,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1081,7 +1081,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1110,11 +1110,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1302,8 +1303,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1316,8 +1317,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1422,8 +1423,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1863,13 +1863,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1882,13 +1881,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1904,14 +1902,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index b0b4ab8b76..a295a7a286 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -598,19 +598,22 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
-sk_func pointers point to btree comparison support functions (ie, 3-way
-comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0).  In an insertion scankey there is at most one
+entry per index column.  There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple.  (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2b18028823..4284a5ff42 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,19 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +83,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +110,14 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,7 +140,6 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
 		Size		itemsz;
@@ -179,8 +174,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,8 +213,7 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
 	/*
@@ -244,13 +237,12 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, itup_key, itup, heapRel, buf,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
@@ -277,6 +269,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -287,10 +281,17 @@ top:
 		 * attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, itup_key, &buf, checkingunique,
+									   itup, stack, heapRel);
+		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, newitemoff,
+					   false);
 	}
 	else
 	{
@@ -301,7 +302,7 @@ top:
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +310,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +321,20 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in itup_key that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTScanInsert itup_key,
+				 IndexTuple itup, Relation heapRel, Buffer buf,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -349,9 +350,17 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Save binary search bounds.  We use them in the fastpath below, but also
+	 * in the _bt_findinsertloc() call later.
+	 */
+	itup_key->savebinsrch = true;
+	offset = _bt_binsrch(rel, itup_key, buf);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(itup_key->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,21 +373,39 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use _bt_binsrch search bounds
+			 * to limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply, when the original
+			 * page is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the location
+			 * where the key would belong to is not at the end of the page.
+			 */
+			if (nbuf == InvalidBuffer && offset == itup_key->stricthigh)
+			{
+				Assert(itup_key->low >= P_FIRSTDATAKEY(opaque));
+				Assert(itup_key->low <= itup_key->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive.  Just advance over killed
+			 * items as quickly as we can.  We only apply _bt_isequal() when
+			 * we get to a non-killed item.  Even those comparisons could be
+			 * avoided (in the common case where there is only one page to
+			 * visit) by reusing bounds, but just skipping dead items is
+			 * sufficiently effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +418,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -552,11 +579,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -611,39 +641,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  *		Once we have chosen the page to put the key on, we'll insert it before
  *		any existing equal keys because of the way _bt_binsrch() works.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, in the insertion scan key.  In the common case that there
+ *		were no duplicates, we don't need to do any additional binary search
+ *		comparisons here.  Though occasionally, we may still not be able to
+ *		reuse the saved state for our own reasons.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
+ *		On entry, *bufptr points to the first legal page where the new tuple
+ *		could be inserted.  The caller must hold an exclusive lock on *bufptr.
+ *
+ *		On exit, *bufptr points to the chosen insertion page, and the offset
+ *		within that page is returned.  If _bt_findinsertloc decides to move
+ *		right, the lock and pin on the original page is released, and the new
  *		page returned to the caller is exclusively locked instead.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It convenient to make it happen here, since microvacuuming
+ *		may invalidate a _bt_check_unique() caller's cached binary search
+ *		work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
+				  BTScanInsert itup_key,
 				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
+				  bool checkingunique,
 				  IndexTuple newtup,
 				  BTStack stack,
 				  Relation heapRel)
 {
 	Buffer		buf = *bufptr;
 	Page		page = BufferGetPage(buf);
+	bool		restorebinsrch = checkingunique;
 	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -672,55 +703,40 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * An earlier _bt_check_unique() call may well have established bounds
+		 * that we can use to skip the high key check for checkingunique
+		 * callers.  This fastpath cannot be used when there are no items on
+		 * the existing page (other than high key), or when it looks like the
+		 * new item belongs last on the page, but it might go on a later page
+		 * instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (restorebinsrch && itup_key->low <= itup_key->stricthigh &&
+			itup_key->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * If this is the last page that the tuple can legally go on, stop
+		 * here
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+		/*
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.
+		 */
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, buf,
+												&restorebinsrch, itemsz))
 			break;
 
 		/*
@@ -763,27 +779,104 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		restorebinsrch = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * If the page we're about to insert to doesn't have enough room for the
+	 * new tuple, we will have to split it.  If it looks like the page has
+	 * LP_DEAD items, try to remove them, in hope of making room for the new
+	 * item and avoiding the split.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		restorebinsrch = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first leaf page locked is still
+	 * locked because that's the page caller will insert on to
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	itup_key->restorebinsrch = restorebinsrch;
+	newitemoff = _bt_binsrch(rel, itup_key, buf);
+	Assert(!itup_key->restorebinsrch);
+	Assert(!restorebinsrch || newitemoff == _bt_binsrch(rel, itup_key, buf));
 
 	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion of
+ *		a duplicate into an index should insert on the page contained in buf
+ *		when a choice must be made.
+ *
+ *		If the current page doesn't have enough free space for the new tuple
+ *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
+ *		will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, Buffer buf,
+					 bool *restorebinsrch, Size itemsz)
+{
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= itemsz)
+		return true;
+
+	/*
+	 * Before considering moving right, see if we can obtain enough space by
+	 * erasing LP_DEAD items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+
+		*restorebinsrch = false;
+		if (PageGetFreeSpace(page) >= itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -2310,24 +2403,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 92832237a8..30be88eb82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -71,13 +71,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -347,26 +334,45 @@ _bt_moveright(Relation rel,
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
+ *
+ * When key.savebinsrch is set, modifies mutable fields of insertion scan
+ * key, so that a subsequent call where caller sets key.restorebinsrch can
+ * reuse the low and strict high bound of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where
+ * stricthigh isn't on the same page (it exceeds maxoff for the page), and
+ * the case where there are no items on the page (high < low).
  */
 OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
 	OffsetNumber low,
-				high;
+				high,
+				stricthigh;
 	int32		result,
 				cmpval;
+	bool		isleaf;
 
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	low = P_FIRSTDATAKEY(opaque);
-	high = PageGetMaxOffsetNumber(page);
+	if (!key->restorebinsrch)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+		isleaf = P_ISLEAF(opaque);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		Assert(P_ISLEAF(opaque));
+		low = key->low;
+		high = key->stricthigh;
+		isleaf = true;
+	}
 
 	/*
 	 * If there are no keys on the page, return the first available slot. Note
@@ -375,8 +381,19 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
+	{
+		if (key->savebinsrch)
+		{
+			Assert(isleaf);
+			/* Caller can't use stricthigh */
+			key->low = low;
+			key->stricthigh = high;
+		}
+		key->savebinsrch = false;
+		key->restorebinsrch = false;
 		return low;
+	}
 
 	/*
 	 * Binary search to find the first key on the page >= scan key, or first
@@ -390,9 +407,12 @@ _bt_binsrch(Relation rel,
 	 *
 	 * We can fall out when high == low.
 	 */
-	high++;						/* establish the loop invariant for high */
+	if (!key->restorebinsrch)
+		high++;					/* establish the loop invariant for high */
+	key->restorebinsrch = false;
+	stricthigh = high;			/* high initially strictly higher */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,12 +420,21 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
 		else
+		{
 			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
 	}
 
 	/*
@@ -415,7 +444,14 @@ _bt_binsrch(Relation rel,
 	 * On a leaf page, we always return the first key >= scan key (resp. >
 	 * scan key), which could be the last slot + 1.
 	 */
-	if (P_ISLEAF(opaque))
+	if (key->savebinsrch)
+	{
+		Assert(isleaf);
+		key->low = low;
+		key->stricthigh = stricthigh;
+		key->savebinsrch = false;
+	}
+	if (isleaf)
 		return low;
 
 	/*
@@ -428,13 +464,8 @@ _bt_binsrch(Relation rel,
 }
 
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -455,17 +486,17 @@ _bt_binsrch(Relation rel,
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +519,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +606,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +853,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +883,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +915,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +983,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1086,17 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.savebinsrch = inskey.restorebinsrch = false;
+	inskey.low = inskey.stricthigh = InvalidOffsetNumber;
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1125,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..759859c302 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -254,6 +254,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -531,6 +532,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1076,7 +1078,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1089,7 +1090,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1097,7 +1097,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1116,8 +1116,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..acd122aa53 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,39 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an own
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->savebinsrch = key->restorebinsrch = false;
+	key->low = key->stricthigh = InvalidOffsetNumber;
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +101,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are
+		 * defensively represented as NULL values.  They should never be
+		 * used.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +125,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7b10fd2974..f97a82ae7b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..7870f282d2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,47 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/*
+	 * Mutable state used by _bt_binsrch to inexpensively repeat a binary
+	 * search on the leaf level.  Only used for insertions where
+	 * _bt_check_unique is called.  See _bt_binsrch and _bt_findinsertloc for
+	 * details.
+	 */
+	bool		savebinsrch;
+	bool		restorebinsrch;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +599,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +613,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v16-0005-Add-split-after-new-tuple-optimization.patchapplication/octet-stream; name=v16-0005-Add-split-after-new-tuple-optimization.patchDownload

From 763d44d83a6206c4a8da6f835abb29fd913e639d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v16 5/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values in indexes with multiple columns.  When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
The optimization is very similar to the long established fillfactor
optimization used during rightmost page splits, where leaf page are
similarly left about 90% full.  It should be considered a variation of
the same optimization.

50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average.  Future insertions can never make use of the
free space left behind.  With this patch, affected cases have leaf pages
that are about 90% full on average (with a fillfactor of 90).  Localized
monotonically increasing insertion patterns are presumed to be fairly
common in real-world applications.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsplitloc.c | 235 +++++++++++++++++++++++-
 1 file changed, 232 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ead218d916..bc7ecb986d 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -70,6 +70,9 @@ static void _bt_recsplitloc(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
 				 bool *newitemonleft);
 static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
@@ -249,9 +252,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default strategy), while applying
-	 * rightmost where appropriate.  Either of the two other fallback
-	 * strategies may be required for cases with a large number of duplicates
-	 * around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback strategies may be required for cases
+	 * with a large number of duplicates around the original/space-optimal
+	 * split point.
 	 *
 	 * Default strategy gives some weight to suffix truncation in deciding a
 	 * split point on leaf pages.  It attempts to select a split point where a
@@ -273,6 +277,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization, either by
+		 * applying leaf fillfactor multiplier, or by choosing the exact split
+		 * point that leaves the new item as last on the left. (usemult is set
+		 * for us.)
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -519,6 +561,193 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in monotonically
+ * increasing order, but that shouldn't matter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by 50:50/!usemult
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index (i.e. typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult)
+{
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(state->is_leaf && !state->is_rightmost);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (state->newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples, since ordinal
+	 * keys are likely to be fixed-width.  Testing if the new tuple is
+	 * variable width directly might also work, but that fails to apply the
+	 * optimization to indexes with a numeric_ops attribute.
+	 *
+	 * Conclude that page has equisized tuples when the new item is the same
+	 * width as the smallest item observed during pass over page, and other
+	 * non-pivot tuples must be the same width as well.  (Note that the
+	 * possibly-truncated existing high key isn't counted in
+	 * olddataitemstotal, and must be subtracted from maxoff.)
+	 */
+	if (state->newitemsz != state->minfirstrightsz)
+		return false;
+	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (state->newitemsz >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
+		sizeof(ItemIdData))
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (state->newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(state->page, maxoff);
+		tup = (IndexTuple) PageGetItem(state->page, itemid);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, but is deemed suitable for the
+	 * optimization, caller splits at the point immediately after the would-be
+	 * position of the new item, and immediately before the item after the new
+	 * item.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
+	tup = (IndexTuple) PageGetItem(state->page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+	/*
+	 * Don't allow caller to split after a new item when it will result in a
+	 * split point to the right of the point that a leaf fillfactor split
+	 * would use -- have caller apply leaf fillfactor instead.  There is no
+	 * advantage to being very aggressive in any case.  It may not be legal to
+	 * split very close to maxoff.
+	 */
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v16-0006-Add-high-key-continuescan-optimization.patchapplication/octet-stream; name=v16-0006-Add-high-key-continuescan-optimization.patchDownload

From d05c2491a542926cb878f2ce08e8dec108123195 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH v16 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.
---
 src/backend/access/nbtree/nbtsearch.c | 23 +++++++--
 src/backend/access/nbtree/nbtutils.c  | 70 +++++++++++++++++++--------
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 1e3c2f638c..9d5c9a9149 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1371,6 +1371,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1419,16 +1420,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * Forward scans need not visit page to the right when high key
+		 * indicates no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+			_bt_checkkeys(scan, page, P_HIKEY, dir, &continuescan);
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1b09ab8d6a..ece82b44f9 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1345,11 +1345,14 @@ _bt_mark_scankey_required(ScanKey skey)
  *
  * scan: index scan descriptor (containing a search-type scankey)
  * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * offnum: offset number of index tuple (must be hikey or a valid item!)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller must hold pin and lock on the index page.  Caller can pass a high
+ * key offnum in the hopes of discovering that the scan need not continue on
+ * to a page to the right.  We don't currently bother limiting high key
+ * comparisons to SK_BT_REQFWD scan keys.
  */
 IndexTuple
 _bt_checkkeys(IndexScanDesc scan,
@@ -1359,6 +1362,7 @@ _bt_checkkeys(IndexScanDesc scan,
 	ItemId		iid = PageGetItemId(page, offnum);
 	bool		tuple_alive;
 	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1372,24 +1376,21 @@ _bt_checkkeys(IndexScanDesc scan,
 	 * killed tuple as not passing the qual.  Most of the time, it's a win to
 	 * not bother examining the tuple's index keys, but just return
 	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
+	 * However, if this is the first tuple on the page, and we're doing a
+	 * backward scan, we should check the index keys to prevent uselessly
+	 * advancing to the page to the left.  This is similar to the high key
+	 * optimization used by forward scan callers.
 	 */
 	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
+		/* forward scan callers check high key instead */
+		Assert(offnum >= P_FIRSTDATAKEY(opaque));
+		if (ScanDirectionIsForward(dir))
+			return NULL;
+		else if (offnum > P_FIRSTDATAKEY(opaque))
+			return NULL;
 
 		/*
 		 * OK, we want to check the keys so we can set continuescan correctly,
@@ -1401,6 +1402,7 @@ _bt_checkkeys(IndexScanDesc scan,
 		tuple_alive = true;
 
 	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1412,11 +1414,24 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(offnum == P_HIKEY);
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
 			return NULL;
 		}
@@ -1547,8 +1562,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1565,6 +1580,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
-- 
2.17.1

#87

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#86)

2 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote:

The regression that I mentioned earlier isn't in pgbench type
workloads (even when the distribution is something more interesting
that the uniform distribution default). It is only in workloads with
lots of page splits and lots of index churn, where we get most of the
benefit of the patch, but also where the costs are most apparent.
Hopefully it can be fixed, but if not I'm inclined to think that it's
a price worth paying. This certainly still needs further analysis and
discussion, though. This revision of the patch does not attempt to
address that problem in any way.

I believe that I've figured out what's going on here.

At first, I thought that this regression was due to the cycles that
have been added to page splits, but that doesn't seem to be the case
at all. Nothing that I did to make page splits faster helped (e.g.
temporarily go back to doing them "bottom up" made no difference). CPU
utilization was consistently slightly *higher* with the master branch
(patch spent slightly more CPU time idle). I now believe that the
problem is with LWLock/buffer lock contention on index pages, and that
that's an inherent cost with a minority of write-heavy high contention
workloads. A cost that we should just accept.

Making the orderline primary key about 40% smaller increases
contention when BenchmarkSQL is run with this particular
configuration. The latency for the NEW_ORDER transaction went from
~4ms average on master to ~5ms average with the patch, while the
latency for other types of transactions was either unchanged or
improved. It's noticeable, but not that noticeable. This needs to be
put in context. The final transactions per minute for this
configuration was 250,000, with a total of only 100 warehouses. What
this boils down to is that the throughput per warehouse is about 8000%
of the maximum valid NOPM specified by the TPC-C spec [1]https://youtu.be/qYeRHK6oq7g?t=1340. In other
words, the database is too small relative to the machine, by a huge
amount, making the result totally and utterly invalid if you go on
what the TPC-C spec says. This exaggerates the LWLock/buffer lock
contention on index pages.

TPC-C is supposed to simulate a real use case with a plausible
configuration, but the details here are totally unrealistic. For
example, there are 3 million customers here (there are always 30k
customers per warehouse). 250k TPM means that there were about 112k
new orders per minute. It's hard to imagine a population of 3 million
customers making 112k orders per minute. That's over 20 million orders
in the first 3 hour long run that I got these numbers from. Each of
these orders has an average of about 10 line items. These people must
be very busy, and must have an awful lot of storage space in their
homes! (There are various other factors here, such as skew, and the
details will never be completely realistic anyway, but you take my
point. TPC-C is *designed* to be a realistic distillation of a real
use case, going so far as to require usable GUI input terminals when
evaluating a formal benchmark submission.)

The benchmark that I posted in mid-February [2]/messages/by-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com (which showed better
performance across the board) was much closer to what the TPC-C spec
requires -- that was only ~400% of maximum valid NOPM (the
BenchmarkSQL html reports will tell you this if you download the
archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be
I/O bound, and I/O bound workloads are what the patch helps with the
most. The general idea with TPC-C's NOPM is that you're required to
increase the number of warehouses as throughput increases. This stops
you from getting an unrealistically favorable result by churning
through a small amount of data, from the same few warehouses.

The only benchmark that I ran that actually satisfied TPC-C's NOPM
requirements had a total of 7,000 warehouses, and was almost a full
terabyte in size on the master branch. This was run on an i3.4xlarge
high I/O AWS ec2 instance. That was substantially I/O bound, and had
an improvement in throughput that was very similar to the mid-February
results which came from my home server -- we see a ~7.5% increase in
transaction throughput after a few hours. I attach a graph of block
device reads/writes for the second 4 hour run for this same 7,000
warehouse benchmark (master and patch). This shows a substantial
reduction in I/O according to OS-level instrumentation. (Note that the
same FS/logical block device was used for both WAL and database
files.)

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression *because* the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.

I'm not seeing the problem when pgbench is run with a small scale
factor but with a high client count. pgbench doesn't have the benefit
of much smaller indexes, so it also doesn't bear any cost when
contention is ramped up. The pgbench_accounts primary key (which is by
far the largest index) is *precisely* the same size as it is on
master, though the other indexes do seem to be a lot smaller. They
were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does
a lot of straight inserts, localized by warehouse, with skewed access.

[1]: https://youtu.be/qYeRHK6oq7g?t=1340
[2]: /messages/by-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com

--
Peter Geoghegan

#88

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#87)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 12/03/2019 04:47, Peter Geoghegan wrote:

In conclusion: I think that this regression is a cost worth accepting.
The regression in throughput is relatively small (2% - 3%), and the
NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens
to be used for 45% of all transactions with TPC-C, and inserts into
the largest index by far, without reading much). The patch overtakes
master after a few hours anyway -- the patch will still win after
about 6 hours, once the database gets big enough, despite all the
contention. As I said, I think that we see a regression*because* the
indexes are much smaller, not in spite of the fact that they're
smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40%
smaller than they are on master, no matter the details, even after
many hours.

Yeah, that's fine. I'm curious, though, could you bloat the indexes back
to the old size by setting the fillfactor?

- Heikki

#89

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#88)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Yeah, that's fine. I'm curious, though, could you bloat the indexes back
to the old size by setting the fillfactor?

I think that that might work, though it's hard to say for sure offhand.

The "split after new item" optimization is supposed to be a variation
of rightmost splits, of course. We apply fillfactor in the same way
much of the time. You would still literally split immediately after
the new item some of the time, though, which makes it unclear how much
bloat there would be without testing it.

Some indexes mostly apply fillfactor in non-rightmost pages, while
other indexes mostly split at the exact point past the new item,
depending on details like the size of the groupings.

I am currently doing a multi-day 6,000 warehouse benchmark, since I
want to be sure that the bloat resistance will hold up over days. I
think that it will, because there aren't that many updates, and
they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor
benchmark with the high-contention/regressed workload on my TODO list,
though.

--
Peter Geoghegan

#90

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Peter Geoghegan (#87)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote:

The regression that I mentioned earlier isn't in pgbench type
workloads (even when the distribution is something more interesting
that the uniform distribution default). It is only in workloads with
lots of page splits and lots of index churn, where we get most of the
benefit of the patch, but also where the costs are most apparent.
Hopefully it can be fixed, but if not I'm inclined to think that it's
a price worth paying. This certainly still needs further analysis and
discussion, though. This revision of the patch does not attempt to
address that problem in any way.

I believe that I've figured out what's going on here.

At first, I thought that this regression was due to the cycles that
have been added to page splits, but that doesn't seem to be the case
at all. Nothing that I did to make page splits faster helped (e.g.
temporarily go back to doing them "bottom up" made no difference). CPU
utilization was consistently slightly *higher* with the master branch
(patch spent slightly more CPU time idle). I now believe that the
problem is with LWLock/buffer lock contention on index pages, and that
that's an inherent cost with a minority of write-heavy high contention
workloads. A cost that we should just accept.

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#91

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#90)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

Yes.

--
Peter Geoghegan

#92

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Peter Geoghegan (#91)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote:

If I wanted to try to say this in fewer words, would it be fair to say
that reducing the size of an index by 40% without changing anything
else can increase contention on the remaining pages?

Yes.

Hey, I understood something today!

I think it's pretty clear that we have to view that as acceptable. I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use. We should optimize for the case
where the user intends to keep the database around for years, not
hours.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#93

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#92)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:

Hey, I understood something today!

And I said something that could be understood!

I think it's pretty clear that we have to view that as acceptable. I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use. We should optimize for the case
where the user intends to keep the database around for years, not
hours.

I think so too. There is a feature in other database systems called
"reverse key indexes", which deals with this problem in a rather
extreme way. This situation is very similar to the situation with
rightmost page splits, where fillfactor is applied to pack leaf pages
full. The only difference is that there are multiple groupings, not
just one single implicit grouping (everything in the index). You could
probably make very similar observations about rightmost page splits
that apply leaf fillfactor.

Here is an example of how the largest index looks for master with the
100 warehouse case that was slightly regressed:

table_name | index_name | page_type | npages |
avg_live_items | avg_dead_items | avg_item_size
------------------+-----------------------+-----------+---------+----------------+----------------+---------------
bmsql_order_line | bmsql_order_line_pkey | R | 1 |
54.000 | 0.000 | 23.000
bmsql_order_line | bmsql_order_line_pkey | I | 11,482 |
143.200 | 0.000 | 23.000
bmsql_order_line | bmsql_order_line_pkey | L | 1,621,316 |
139.458 | 0.003 | 24.000

Here is what we see with the patch:

table_name | index_name | page_type | npages |
avg_live_items | avg_dead_items | avg_item_size
------------------+-----------------------+-----------+---------+----------------+----------------+---------------
bmsql_order_line | bmsql_order_line_pkey | R | 1 |
29.000 | 0.000 | 22.000
bmsql_order_line | bmsql_order_line_pkey | I | 5,957 |
159.149 | 0.000 | 23.000
bmsql_order_line | bmsql_order_line_pkey | L | 936,170 |
233.496 | 0.052 | 23.999

REINDEX would leave bmsql_order_line_pkey with 262 items, and we see
here that it has 233 after several hours, which is pretty good given
the amount of contention. The index actually looks very much like it
was just REINDEXED when initial bulk loading finishes, before we get
any updates, even though that happens using retail insertions.

Notice that the number of internal pages is reduced by almost a full
50% -- it's somewhat better than the reduction in the number of leaf
pages, because the benefits compound (items in the root are even a bit
smaller, because of this compounding effect, despite alignment
effects). Internal pages are the most important pages to have cached,
but also potentially the biggest points of contention.

--
Peter Geoghegan

#94

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Peter Geoghegan (#87)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Hi,

On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote:

I now believe that the problem is with LWLock/buffer lock contention
on index pages, and that that's an inherent cost with a minority of
write-heavy high contention workloads. A cost that we should just
accept.

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Greetings,

Andres Freund

#95

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Andres Freund (#94)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote:

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Not recently, though I did use your BCC script for this very purpose
quite a few months ago. I don't remember it helping that much at the
time, but then that was with a version of the patch that lacked a
couple of important optimizations that we have now. We're now very
careful to not descend to the left with an equal pivot tuple. We
descend right instead when that's definitely the only place we'll find
matches (a high key doesn't count as a match in almost all cases!).
Edge-cases where we unnecessarily move left then right, or
unnecessarily move right a second time once on the leaf level have
been fixed. I fixed the regression I was worried about at the time,
without getting much benefit from the BCC script, and moved on.

This kind of minutiae is more important than it sounds. I have used
EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
understand where every single block access comes from with these
edge-cases, paying close attention to the structure of the index, and
how the key space is broken up (the values of pivot tuples in internal
pages). It is one thing to make the index smaller, and another thing
to take full advantage of that -- I have both. This is one of the
reasons why I believe that this minor regression cannot be avoided,
short of simply allowing the index to get bloated: I'm simply not
doing things that differently outside of the page split code, and what
I am doing differently is clearly superior. Both in general, and for
the NEW_ORDER transaction in particular.

I'll make that another TODO item -- this regression will be revisited
using BCC instrumentation. I am currently performing a multi-day
benchmark on a very large TPC-C/BenchmarkSQL database, and it will
have to wait for that. (I would like to use the same environment as
before.)

--
Peter Geoghegan

#96

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Peter Geoghegan (#95)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote:

On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote:

Have you looked at an offwake or lwlock wait graph (bcc tools) or
something in that vein? Would be interesting to see what is waiting for
what most often...

Not recently, though I did use your BCC script for this very purpose
quite a few months ago. I don't remember it helping that much at the
time, but then that was with a version of the patch that lacked a
couple of important optimizations that we have now. We're now very
careful to not descend to the left with an equal pivot tuple. We
descend right instead when that's definitely the only place we'll find
matches (a high key doesn't count as a match in almost all cases!).
Edge-cases where we unnecessarily move left then right, or
unnecessarily move right a second time once on the leaf level have
been fixed. I fixed the regression I was worried about at the time,
without getting much benefit from the BCC script, and moved on.

This kind of minutiae is more important than it sounds. I have used
EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I
understand where every single block access comes from with these
edge-cases, paying close attention to the structure of the index, and
how the key space is broken up (the values of pivot tuples in internal
pages). It is one thing to make the index smaller, and another thing
to take full advantage of that -- I have both. This is one of the
reasons why I believe that this minor regression cannot be avoided,
short of simply allowing the index to get bloated: I'm simply not
doing things that differently outside of the page split code, and what
I am doing differently is clearly superior. Both in general, and for
the NEW_ORDER transaction in particular.

I'll make that another TODO item -- this regression will be revisited
using BCC instrumentation. I am currently performing a multi-day
benchmark on a very large TPC-C/BenchmarkSQL database, and it will
have to wait for that. (I would like to use the same environment as
before.)

I'm basically just curious which buffers have most of the additional
contention. Is it the lower number of leaf pages, the inner pages, or
(somewhat unexplicably) the meta page, or ...? I was thinking that the
callstack that e.g. my lwlock tool gives should be able to explain what
callstack most of the waits are occuring on.

(I should work a bit on that script, I locally had a version that showed
both waiters and the waking up callstack, but I don't find it anymore)

Greetings,

Andres Freund

#97

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Andres Freund (#96)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 2:22 PM Andres Freund <andres@anarazel.de> wrote:

I'm basically just curious which buffers have most of the additional
contention. Is it the lower number of leaf pages, the inner pages, or
(somewhat unexplicably) the meta page, or ...? I was thinking that the
callstack that e.g. my lwlock tool gives should be able to explain what
callstack most of the waits are occuring on.

Right -- that's exactly what I'm interested in, too. If we can
characterize the contention in terms of the types of nbtree blocks
that are involved (their level), that could be really helpful. There
are 200x+ more leaf blocks than internal blocks, so the internal
blocks are a lot hotter. But, there is also a lot fewer splits of
internal pages, because you need hundreds of leaf page splits to get
one internal split.

Is the problem contention caused by internal page splits, or is it
contention in internal pages that can be traced back to leaf splits,
that insert a downlink in to their parent page? It would be really
cool to have some idea of the answers to questions like these.

--
Peter Geoghegan

#98

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#67)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.

Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov). We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.

I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.

The first 3 patches seem commitable now, but I think that it's
important to be sure that I've addressed everything you raised
satisfactorily before pushing. Or that everything works in a way that
you can live with, at least.

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.

Thanks
--
Peter Geoghegan

#99

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#92)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think it's pretty clear that we have to view that as acceptable. I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization -- a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
--
Peter Geoghegan

#100

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#98)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 13/03/2019 03:28, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.

Now that you have a full understanding of how the negative infinity
sentinel values work, and how page deletion's leaf page search and
!heapkeyspace indexes need to be considered, I think that we should
come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is
that you now have a full understanding of all the subtleties of the
patch, including those that that affect unique index insertion. That
will make it much easier to talk about these unresolved questions.

My current sense is that it isn't useful to store the current buffer
alongside the binary search bounds/hint. It'll hardly ever need to be
invalidated, because we'll hardly ever have to move right within
_bt_findsplitloc() when doing unique index insertion (as I said
before, the regression tests *never* have to do this according to
gcov).

It doesn't matter how often it happens, the code still needs to deal
with it. So let's try to make it as readable as possible.

We're talking about a very specific set of conditions here, so
I'd like something that's lightweight and specialized. I agree that
the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't
think of anything that's better offhand. Perhaps you can suggest
something that is both lightweight, and an improvement on
savebinsrch/restorebinsrch.

Well, IMHO holding the buffer and the bounds in the new struct is more
clean than the savebinsrc/restorebinsrch flags. That's exactly why I
suggested it. I don't know what else to suggest. I haven't done any
benchmarking, but I doubt there's any measurable difference.

I'm of the opinion that having a separate _bt_binsrch_insert() does
not make anything clearer. Actually, I think that saving the bounds
within the original _bt_binsrch() makes the design of that function
clearer, not less clear. It's all quite confusing at the moment, given
the rightmost/!leaf/page empty special cases. Seeing how the bounds
are reused (or not reused) outside of _bt_binsrch() helps with that.

Ok. I think having some code duplication is better than one function
that tries to do many things, but I'm not wedded to that.

- Heikki

#101

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#98)

1 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 13/03/2019 03:28, Peter Geoghegan wrote:

It would be great if you could take a look at the 'Add high key
"continuescan" optimization' patch, which is the only one you haven't
commented on so far (excluding the amcheck "relocate" patch, which is
less important). I can put that one off for a while after the first 3
go in. I will also put off the "split after new item" commit for at
least a week or two. I'm sure that the idea behind the "continuescan"
patch will now seem pretty obvious to you -- it's just taking
advantage of the high key when an index scan on the leaf level (which
uses a search style scankey, not an insertion style scankey) looks
like it may have to move to the next leaf page, but we'd like to avoid
it where possible. Checking the high key there is now much more likely
to result in the index scan not going to the next page, since we're
more careful when considering a leaf split point these days. The high
key often looks like the items on the page to the right, not the items
on the same page.

Oh yeah, that makes perfect sense. I wonder why we haven't done it like
that before? The new page split logic makes it more likely to help, but
even without that, I don't see any downside.

I find it a bit confusing, that the logic is now split between
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
does the high-key check, but the corresponding "first-key" check in a
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
completely to _bt_readpage(), so that it's in one place. With that,
_bt_checkkeys() can always check the keys as it's told, without looking
at the LP_DEAD flag. Like the attached.

- Heikki

Attachments:

v16-heikki-0001-Add-high-key-continuescan-optimization.patchtext/x-patch; name=v16-heikki-0001-Add-high-key-continuescan-optimization.patchDownload

From 4b5ea0f361e3feda93852bd084fb0d325e808e4c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 12 Nov 2018 13:11:21 -0800
Subject: [PATCH 1/1] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the last item on leaf pages, even when it's clear that it
cannot be returned to the scan due to being dead-to-all, for the same
reason.  Since forcing the last item to be key checked no longer makes
any difference in the case of forward scans, the existing extra key
check is now only used for backwards scans.  Like the existing check,
the new check won't always work out, but that seems like an acceptable
price to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  The high key
represents an upper bound on all values that can appear on the page,
which is often greater than whatever tuple happens to appear last at the
time of the check.  Also, suffix truncation's new logic for picking a
split point will often result in high keys that are relatively
dissimilar to the other (non-pivot) tuples on the page, and therefore
more likely to indicate that the scan need not proceed to the next page.

Note that even pre-pg_upgrade'd v3 indexes make use of this
optimization.

(This is Heikki's refactored version)
---
 src/backend/access/nbtree/nbtsearch.c |  86 ++++++++++++++++++---
 src/backend/access/nbtree/nbtutils.c  | 103 +++++++++++---------------
 src/include/access/nbtree.h           |   3 +-
 3 files changed, 122 insertions(+), 70 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b6..243be6f410d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1220,7 +1220,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
 	int			itemIndex;
-	IndexTuple	itup;
 	bool		continuescan;
 
 	/*
@@ -1241,6 +1240,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1282,23 +1282,58 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum <= maxoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				offnum = OffsetNumberNext(offnum);
+				continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			if (_bt_checkkeys(scan, itup, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * We don't need to visit page to the right when the high key
+		 * indicates that no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
+
+			_bt_checkkeys(scan, itup, dir, &continuescan);
+		}
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
@@ -1313,8 +1348,41 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum >= minoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		tuple_alive;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple (previous, actually,
+			 * since we're scaning backwards) .  However, if this is the
+			 * first tuple on the page, we do check the index keys, to prevent
+			 * uselessly advancing to the page to the left.  This is similar
+			 * to the high key optimization used by forward scans.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+				Assert(offnum >= P_FIRSTDATAKEY(opaque));
+				if (offnum > P_FIRSTDATAKEY(opaque))
+				{
+					offnum = OffsetNumberPrev(offnum);
+					continue;
+				}
+
+				tuple_alive = false;
+			}
+			else
+				tuple_alive = true;
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			passes_quals = _bt_checkkeys(scan, itup, dir, &continuescan);
+			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				itemIndex--;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e451..1b1d889bba8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -47,7 +47,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 
 
@@ -1350,8 +1350,7 @@ _bt_mark_scankey_required(ScanKey skey)
 /*
  * Test whether an indextuple satisfies all the scankey conditions.
  *
- * If so, return the address of the index tuple on the index page.
- * If not, return NULL.
+ * Returns true, if so.
  *
  * If the tuple fails to pass the qual, we also determine whether there's
  * any need to continue the scan beyond this tuple, and set *continuescan
@@ -1359,21 +1358,19 @@ _bt_mark_scankey_required(ScanKey skey)
  * this is done.
  *
  * scan: index scan descriptor (containing a search-type scankey)
- * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * tuple: index tuple to test
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
  *
- * Caller must hold pin and lock on the index page.
+ * Caller can pass a high key tuple in the hopes of discovering that the scan
+ * need not continue on to a page to the right.  We don't currently bother
+ * limiting high key comparisons to SK_BT_REQFWD scan keys.
  */
-IndexTuple
-_bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
 			  ScanDirection dir, bool *continuescan)
 {
-	ItemId		iid = PageGetItemId(page, offnum);
-	bool		tuple_alive;
-	IndexTuple	tuple;
+	int			tupnatts;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
@@ -1382,40 +1379,7 @@ _bt_checkkeys(IndexScanDesc scan,
 
 	*continuescan = true;		/* default assumption */
 
-	/*
-	 * If the scan specifies not to return killed tuples, then we treat a
-	 * killed tuple as not passing the qual.  Most of the time, it's a win to
-	 * not bother examining the tuple's index keys, but just return
-	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
-	 */
-	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
-	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
-
-		/*
-		 * OK, we want to check the keys so we can set continuescan correctly,
-		 * but we'll return NULL even if the tuple passes the key tests.
-		 */
-		tuple_alive = false;
-	}
-	else
-		tuple_alive = true;
-
-	tuple = (IndexTuple) PageGetItem(page, iid);
+	tupnatts = BTreeTupleGetNAtts(tuple, scan->indexRelation);
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1427,13 +1391,25 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (key->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
-			return NULL;
+			return false;
 		}
 
 		datum = index_getattr(tuple,
@@ -1471,7 +1447,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		if (isNull)
@@ -1512,7 +1488,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
@@ -1540,16 +1516,12 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 	}
 
-	/* Check for failure due to it being a killed tuple. */
-	if (!tuple_alive)
-		return NULL;
-
 	/* If we get here, the tuple passes all index quals. */
-	return tuple;
+	return true;
 }
 
 /*
@@ -1562,8 +1534,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1580,6 +1552,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		/*
+		 * Assume that truncated attribute (from high key) passes the qual.
+		 * The value of a truncated attribute for the first tuple on the right
+		 * page could be any possible value, so we may have to visit the next
+		 * page.
+		 */
+		if (subkey->sk_attno > tupnatts)
+		{
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea7906..53e4e6d194d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -586,8 +586,7 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
 			  ScanDirection dir, bool *continuescan);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
-- 
2.20.1

#102

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#101)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Oh yeah, that makes perfect sense. I wonder why we haven't done it like
that before? The new page split logic makes it more likely to help, but
even without that, I don't see any downside.

The only downside is that we spend a few extra cycles, and that might
not work out. This optimization would have always worked, though. The
new page split logic clearly makes checking the high key in the
"continuescan" path an easy win.

I find it a bit confusing, that the logic is now split between
_bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage()
does the high-key check, but the corresponding "first-key" check in a
backward scan is done in _bt_checkkeys(). I'd suggest moving the logic
completely to _bt_readpage(), so that it's in one place. With that,
_bt_checkkeys() can always check the keys as it's told, without looking
at the LP_DEAD flag. Like the attached.

I'm convinced. I'd like to go a bit further, and also pass tupnatts to
_bt_checkkeys(). That makes it closer to the similar
_bt_check_rowcompare() function that _bt_checkkeys() must sometimes
call. It also allows us to only call BTreeTupleGetNAtts() for the high
key, while passes down a generic, loop-invariant
IndexRelationGetNumberOfAttributes() value for non-pivot tuples.

I'll do it that way in the next revision.

Thanks
--
Peter Geoghegan

#103

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#100)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It doesn't matter how often it happens, the code still needs to deal
with it. So let's try to make it as readable as possible.

Well, IMHO holding the buffer and the bounds in the new struct is more
clean than the savebinsrc/restorebinsrch flags. That's exactly why I
suggested it. I don't know what else to suggest. I haven't done any
benchmarking, but I doubt there's any measurable difference.

Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.

I plan on committing the first few patches early next week, barring
any objections, or any performance problems noticed during an
additional, final round of performance validation. I won't expect
feedback from you until Monday at the earliest. It would be nice if
you could take a look at the amcheck "relocate" patch. My intention is
to push patches up to and including the amcheck "relocate" patch on
the same day (I'll leave a few hours between the first two patches, to
confirm that the first patch doesn't break the buildfarm).

BTW, my multi-day, large BenchmarkSQL benchmark continues, with some
interesting results. The first round of 12 hour long runs showed the
patch nearly 6% ahead in terms of transaction throughput, with a
database that's almost 1 terabyte. The second round, which completed
yesterday and reuses the database initialized for the first round
showed that the patch had 10.7% higher throughput. That's a new record
for the patch. I'm going to leave this benchmark running for a few
more days, at least until it stops being interesting. I wonder how
long it will be before the master branch throughput stops declining
relative to throughput with the patched version. I expect that the
master branch will reach "index bloat saturation point" sooner or
later. Indexes in the patch's data directory continue to get larger,
as expected, but the amount of bloat accumulated over time is barely
noticeable (i.e. the pages are packed tight with tuples, which barely
declines over time).

This version of the patch series has attributions/credits at the end
of the commit messages. I have listed you as a secondary author on a
couple of the patches, where code was lifted from your feedback
patches. Let me know if you think that I have it right.

Thanks
--
Peter Geoghegan

Attachments:

v17-0001-Refactor-nbtree-insertion-scankeys.patchapplication/octet-stream; name=v17-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From 5ee2189028b8d58ccf8f0bbd965c244c2d72580d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v17 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache their binary
search in an ad-hoc manner.  This makes it easy to add a new
optimization: _bt_check_unique() now falls out of its loop immediately
in the common case where it's already clear that there couldn't possibly
be a duplicate.  More importantly, the new _bt_check_unique() scheme
makes it a lot easier to manage cached binary search effort afterwards,
from within _bt_findinsertloc().  This is needed for the upcoming patch
to make nbtree tuples unique by treating heap TID as a final tiebreaker
column.

Based on a suggestion by Andrey Lepikhov.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/README      |  29 +-
 src/backend/access/nbtree/nbtinsert.c | 421 ++++++++++++++++----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 224 ++++++++++----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  96 ++----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  80 ++++-
 9 files changed, 563 insertions(+), 375 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index bb6442de82..5426bfd8d8 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -127,9 +127,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -139,14 +139,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -838,8 +838,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1030,7 +1030,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1082,7 +1082,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1111,11 +1111,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1303,8 +1304,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1317,8 +1318,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1423,8 +1424,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1864,13 +1864,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1883,13 +1882,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1905,14 +1903,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index b0b4ab8b76..a295a7a286 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -598,19 +598,22 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
-sk_func pointers point to btree comparison support functions (ie, 3-way
-comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0).  In an insertion scankey there is at most one
+entry per index column.  There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple.  (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2997b1111a..54dfa0a5e1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,17 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel,
+					 BTInsertState insertstate);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +81,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +108,26 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTInsertStateData insertstate;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
+
+	/*
+	 * Fill in the BTInsertState working area, to track the current page and
+	 * position within the page, to insert to.
+	 */
+	insertstate.itup = itup;
+	/* PageAddItem will MAXALIGN(), but be consistent */
+	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+	insertstate.itup_key = itup_key;
+	insertstate.bounds_valid = false;
+	insertstate.buf = InvalidBuffer;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,10 +150,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
-		Size		itemsz;
 		Page		page;
 		BTPageOpaque lpageop;
 
@@ -166,9 +170,6 @@ top:
 			page = BufferGetPage(buf);
 
 			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			itemsz = IndexTupleSize(itup);
-			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
-										 * but we need to be consistent */
 
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
@@ -177,10 +178,9 @@ top:
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
-				(PageGetFreeSpace(page) > itemsz) &&
+				(PageGetFreeSpace(page) > insertstate.itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,10 +219,12 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
+	insertstate.buf = buf;
+	buf = InvalidBuffer;		/* insertstate.buf now owns the buffer */
+
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
 	 * the index.
@@ -244,19 +246,19 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
+		xwait = _bt_check_unique(rel, &insertstate, heapRel,
 								 checkUnique, &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
 		{
 			/* Have to wait for the other guy ... */
-			_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, insertstate.buf);
+			insertstate.buf = InvalidBuffer;
 
 			/*
 			 * If it's a speculative insertion, wait for it to finish (ie. to
@@ -277,6 +279,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -286,22 +290,29 @@ top:
 		 * This reasoning also applies to INCLUDE indexes, whose extra
 		 * attributes are not considered part of the key space.
 		 */
-		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
+
+		/*
+		 * Do the insertion.  Note that itup_key contains state filled in by
+		 * _bt_check_unique to help _bt_findinsertloc avoid repeating its
+		 * binary search.  !checkingunique case must start its own binary
+		 * search.
+		 */
+		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+									   stack, heapRel);
+		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
 		/* just release the buffer */
-		_bt_relbuf(rel, buf);
+		_bt_relbuf(rel, insertstate.buf);
 	}
 
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +320,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +331,22 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in insertstate that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	IndexTuple	itup = insertstate->itup;
+	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -345,13 +358,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 	InitDirtySnapshot(SnapshotDirty);
 
-	page = BufferGetPage(buf);
+	page = BufferGetPage(insertstate->buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Find the first tuple with the same key.
+	 *
+	 * This also saves the binary search bounds in insertstate.  We use them
+	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
+	 */
+	offset = _bt_binsrch_insert(rel, insertstate);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,21 +386,39 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use cached search bounds to
+			 * limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply when the original page
+			 * is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the location
+			 * where the key would belong to is not at the end of the page.
+			 */
+			if (nbuf == InvalidBuffer && offset == insertstate->stricthigh)
+			{
+				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
+				Assert(insertstate->low <= insertstate->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive.  Just advance over killed
+			 * items as quickly as we can.  We only apply _bt_isequal() when
+			 * we get to a non-killed item.  Even those comparisons could be
+			 * avoided (in the common case where there is only one page to
+			 * visit) by reusing bounds, but just skipping dead items is
+			 * sufficiently effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +431,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -488,7 +528,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 * otherwise be masked by this unique constraint
 					 * violation.
 					 */
-					CheckForSerializableConflictIn(rel, NULL, buf);
+					CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
 
 					/*
 					 * This is a definite conflict.  Break the tuple down into
@@ -500,7 +540,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 */
 					if (nbuf != InvalidBuffer)
 						_bt_relbuf(rel, nbuf);
-					_bt_relbuf(rel, buf);
+					_bt_relbuf(rel, insertstate->buf);
+					insertstate->buf = InvalidBuffer;
 
 					{
 						Datum		values[INDEX_MAX_KEYS];
@@ -540,7 +581,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					if (nbuf != InvalidBuffer)
 						MarkBufferDirtyHint(nbuf, true);
 					else
-						MarkBufferDirtyHint(buf, true);
+						MarkBufferDirtyHint(insertstate->buf, true);
 				}
 			}
 		}
@@ -552,11 +593,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -600,57 +644,42 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, insertstate buffer contains the page that the new tuple
+ *		belongs on, which is exclusive-locked and pinned by caller.  This
+ *		won't be exactly the right page for some callers to insert on to.
+ *		They'll have to insert into a page somewhere to the right.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		On exit, insertstate buffer contains the chosen insertion page, and
+ *		the offset within that page is returned.  The lock and pin on the
+ *		original page are released in cases where initial page is not where
+ *		tuple belongs.  New buffer/page passed back to the caller is
+ *		exclusively locked, just like initial page was.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, also in the insertion state.  We don't need to do any
+ *		additional binary search comparisons here most of the time, provided
+ *		caller is to insert on original page.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It convenient to make it happen here, since microvacuuming
+ *		may invalidate a _bt_check_unique() caller's cached binary search
+ *		work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel)
 {
-	Buffer		buf = *bufptr;
+	Buffer		buf = insertstate->buf;
+	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(buf);
-	Size		itemsz;
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	itemsz = IndexTupleSize(newtup);
-	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
-								 * need to be consistent */
-
 	/*
 	 * Check whether the item can fit on a btree page at all. (Eventually, we
 	 * ought to try to apply TOAST methods if not.) We actually need to be
@@ -660,11 +689,11 @@ _bt_findinsertloc(Relation rel,
 	 *
 	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
 	 */
-	if (itemsz > BTMaxItemSize(page))
+	if (insertstate->itemsz > BTMaxItemSize(page))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
+						insertstate->itemsz, BTMaxItemSize(page),
 						RelationGetRelationName(rel)),
 				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 						 "Consider a function index of an MD5 hash of the value, "
@@ -672,55 +701,40 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * An earlier _bt_check_unique() call may well have established bounds
+		 * that we can use to skip the high key check for checkingunique
+		 * callers.  This fastpath cannot be used when there are no items on
+		 * the existing page (other than high key), or when it looks like the
+		 * new item belongs last on the page, but it might go on a later page
+		 * instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
-
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		if (insertstate->bounds_valid &&
+			insertstate->low <= insertstate->stricthigh &&
+			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * If this is the last page that the tuple can legally go on, stop
+		 * here
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
+
+		/*
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.  Note that this may invalidate cached bounds for us.
+		 */
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
 			break;
 
 		/*
@@ -763,27 +777,101 @@ _bt_findinsertloc(Relation rel,
 		}
 		_bt_relbuf(rel, buf);
 		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		insertstate->buf = buf;
+		insertstate->bounds_valid = false;
+	}
+
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
+
+	/*
+	 * If the page we're about to insert to doesn't have enough room for the
+	 * new tuple, we will have to split it.  If it looks like the page has
+	 * LP_DEAD items, try to remove them, in hope of making room for the new
+	 * item and avoiding the split.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < insertstate->itemsz)
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+		insertstate->bounds_valid = false;
 	}
 
 	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
+	 * Reuse binary search bounds established within _bt_check_unique if
+	 * caller is checkingunique caller, and first leaf page locked is still
+	 * locked because that's the page caller will insert on to
 	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
 
-	*bufptr = buf;
-	*offsetptr = newitemoff;
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion of
+ *		a duplicate into an index should insert on the page contained in buf
+ *		when a choice must be made.
+ *
+ *		If the current page doesn't have enough free space for the new tuple
+ *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
+ *		will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel,
+					 BTInsertState insertstate)
+{
+	Buffer		buf = insertstate->buf;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= insertstate->itemsz)
+		return true;
+
+	/*
+	 * Before considering moving right, see if we can obtain enough space by
+	 * erasing LP_DEAD items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+		insertstate->bounds_valid = false;
+
+		if (PageGetFreeSpace(page) >= insertstate->itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -2312,24 +2400,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b..2d26d1f0dc 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,6 +24,7 @@
 #include "utils/rel.h"
 
 
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -71,13 +72,9 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +90,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +127,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +140,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +194,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +211,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +238,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +264,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +299,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +319,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -348,12 +336,10 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-OffsetNumber
+static OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -375,7 +361,7 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
 		return low;
 
 	/*
@@ -392,7 +378,7 @@ _bt_binsrch(Relation rel,
 	 */
 	high++;						/* establish the loop invariant for high */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,7 +386,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -427,14 +413,117 @@ _bt_binsrch(Relation rel,
 	return OffsetNumberPrev(low);
 }
 
+/*
+ * Like _bt_binsrch(), but with support for caching the binary search bounds.
+ * Used during insertion, and only on the leaf level.
+ *
+ * Caches the bounds fields in insertstate, so that a subsequent call can
+ * reuse the low and strict high bound of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where
+ * stricthigh isn't on the same page (it exceeds maxoff for the page), and
+ * the case where there are no items on the page (high < low).
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+{
+	BTScanInsert key = insertstate->itup_key;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber low,
+				high,
+				stricthigh;
+	int32		result,
+				cmpval;
+
+	page = BufferGetPage(insertstate->buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque));
+	Assert(!key->nextkey);
+
+	if (!insertstate->bounds_valid)
+	{
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = insertstate->low;
+		high = insertstate->stricthigh;
+	}
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+	{
+		/* Caller can't use stricthigh */
+		insertstate->bounds_valid = false;
+		return low;
+	}
+
+	/*
+	 * Binary search to find the first key on the page >= scan key. (nextkey
+	 * is always false when inserting)
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	if (!insertstate->bounds_valid)
+		high++;					/* establish the loop invariant for high */
+	stricthigh = high;			/* high initially strictly higher */
+
+	cmpval = 1;					/* select comparison value */
+
+	while (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		result = _bt_compare(rel, key, page, mid);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+		{
+			high = mid;
+
+			/*
+			 * high can only be reused by more restrictive binary search when
+			 * it's known to be strictly greater than the original scankey
+			 */
+			if (result != 0)
+				stricthigh = high;
+		}
+	}
+
+	/*
+	 * At this point we have high == low, but be careful: they could point
+	 * past the last slot on the page.
+	 *
+	 * On a leaf page, we always return the first key >= scan key (resp. >
+	 * scan key), which could be the last slot + 1.
+	 */
+	insertstate->low = low;
+	insertstate->stricthigh = stricthigh;
+	insertstate->bounds_valid = true;
+
+	return low;
+}
+
+
+
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -447,25 +536,26 @@ _bt_binsrch(Relation rel,
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
+ * scankey.  The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon.  This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key.  See backend/access/nbtree/README for details.
  *----------
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +578,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +665,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +912,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +942,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +974,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +1021,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +1042,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1145,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1182,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 363dceb5b1..a0e2e70cef 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -263,6 +263,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -540,6 +541,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1085,7 +1087,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1098,7 +1099,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1106,7 +1106,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1125,8 +1125,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..0250e089a6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,37 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +99,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are
+		 * defensively represented as NULL values.  They should never be
+		 * used.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +123,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2946b47b46..16bda5c586 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..7d7faa59c2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,66 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+
+typedef struct BTScanInsertData
+{
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+/*
+ * Working area used during insertion, to track where to insert to.
+ *
+ * _bt_doinsert fills this in, after descending the tree to the (first legal)
+ * leaf page the new tuple belongs to. It is used to track the current decision
+ * while we perform uniqueness checks and decide the final page to insert to.
+ *
+ * (This should be private to nbtinsert.c, but it's also used by
+ * _bt_binsrch_insert)
+ */
+typedef struct BTInsertStateData
+{
+	IndexTuple	itup;			/* Item we're inserting */
+	Size		itemsz;			/* Size of ituo (excluding item pointer) */
+	BTScanInsert itup_key;		/* Insertion scankey */
+
+	/* Buffer containing leaf page to insert on to */
+	Buffer		buf;
+
+	/*
+	 * Cache of bounds within the current buffer, where the new key value
+	 * belongs to.  Only used for insertions where _bt_check_unique is called.
+	 * See _bt_binsrch_insert and _bt_findinsertloc for details.
+	 */
+	bool		bounds_valid;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+} BTInsertStateData;
+
+typedef BTInsertStateData *BTInsertState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +618,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +632,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v17-0005-Add-split-after-new-tuple-optimization.patchapplication/octet-stream; name=v17-0005-Add-split-after-new-tuple-optimization.patchDownload

From 1a4e551010923073872ee02dd50990fbec02ec93 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v17 5/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values in indexes with multiple columns.  When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
This optimization is a variation of the long established leaf fillfactor
optimization used during rightmost page splits.

50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average.  Future insertions can never make use of the
free space left behind.  With this patch, affected cases have leaf pages
that are about 90% full on average (assuming a fillfactor of 90).
Localized monotonically increasing insertion patterns are presumed to be
fairly common in real-world applications.

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
---
 src/backend/access/nbtree/nbtsplitloc.c | 234 +++++++++++++++++++++++-
 1 file changed, 231 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index ead218d916..a25c41df68 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -70,6 +70,9 @@ static void _bt_recsplitloc(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
 				 bool *newitemonleft);
 static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
@@ -249,9 +252,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default strategy), while applying
-	 * rightmost where appropriate.  Either of the two other fallback
-	 * strategies may be required for cases with a large number of duplicates
-	 * around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback strategies may be required for cases
+	 * with a large number of duplicates around the original/space-optimal
+	 * split point.
 	 *
 	 * Default strategy gives some weight to suffix truncation in deciding a
 	 * split point on leaf pages.  It attempts to select a split point where a
@@ -273,6 +277,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization, either by
+		 * applying leaf fillfactor multiplier, or by choosing the exact split
+		 * point that leaves the new item as last on the left. (usemult is set
+		 * for us.)
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -519,6 +561,192 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in monotonically
+ * increasing order, but that shouldn't matter).  Without this optimization,
+ * approximately 50% of space in leaf pages will be wasted by 50:50/!usemult
+ * page splits.  With this optimization, space utilization will be close to
+ * that of a similar index where all tuple insertions modify the current
+ * rightmost leaf page in the index (i.e. typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult)
+{
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(state->is_leaf && !state->is_rightmost);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (state->newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples, since ordinal
+	 * keys are likely to be fixed-width.  Testing if the new tuple is
+	 * variable width directly might also work, but that fails to apply the
+	 * optimization to indexes with a numeric_ops attribute.
+	 *
+	 * Conclude that page has equisized tuples when the new item is the same
+	 * width as the smallest item observed during pass over page, and other
+	 * non-pivot tuples must be the same width as well.  (Note that the
+	 * possibly-truncated existing high key isn't counted in
+	 * olddataitemstotal, and must be subtracted from maxoff.)
+	 */
+	if (state->newitemsz != state->minfirstrightsz)
+		return false;
+	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (state->newitemsz >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
+		sizeof(ItemIdData))
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (state->newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(state->page, maxoff);
+		tup = (IndexTuple) PageGetItem(state->page, itemid);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * When item isn't last (or first) on page, but is deemed suitable for the
+	 * optimization, caller splits at the point immediately after the would-be
+	 * position of the new item, and immediately before the item after the new
+	 * item.
+	 *
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g. an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
+	tup = (IndexTuple) PageGetItem(state->page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+	/*
+	 * Don't allow caller to split after a new item when it will result in a
+	 * split point to the right of the point that a leaf fillfactor split
+	 * would use -- have caller apply leaf fillfactor instead.  There is no
+	 * advantage to being very aggressive in any case.
+	 */
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v17-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchapplication/octet-stream; name=v17-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchDownload

From 042601ffd258a0f7cf3dc101696558adff0ddc08 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v17 2/7] Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d item pointer in a new
high key during leaf page splits.  The user-facing definition of the
"1/3 of a page" restriction is already imprecise, and so does not need
to be revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 331 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 136 +++---
 src/backend/access/nbtree/nbtinsert.c        | 305 +++++++++-----
 src/backend/access/nbtree/nbtpage.c          | 200 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 106 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 412 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  47 +--
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 203 +++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   9 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 src/test/regress/sql/foreign_data.sql        |   2 +-
 29 files changed, 1550 insertions(+), 514 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 5426bfd8d8..027dfce78a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -46,6 +46,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -67,6 +69,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -123,7 +127,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -138,17 +142,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -205,6 +214,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -255,7 +265,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -325,8 +337,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -347,6 +359,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -807,7 +820,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -840,6 +854,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -866,7 +881,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -907,7 +923,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -941,9 +1006,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -969,11 +1060,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1036,7 +1126,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1214,9 +1304,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1305,7 +1395,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup);
 }
 
 /*
@@ -1368,7 +1458,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1417,14 +1508,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1856,6 +1962,61 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  However, it is not capable of determining that a
+	 * scankey is _less than_ a tuple on the basis of a comparison resolved at
+	 * _scankey_ minus infinity attribute.  Complete an extra step to make it
+	 * work here instead.
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1875,42 +2036,84 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2066,3 +2269,53 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically prevents insertion scankey from
+ * being considered greater than the pivot tuple that its values originated
+ * from (or some other identical pivot tuple) in the common case where there
+ * are truncated/minus infinity attributes.  Without this extra step, there
+ * are forms of corruption that amcheck could theoretically fail to report.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
+ * might otherwise cause amcheck to assume (rather than actually verify) that
+ * the scankey is greater.
+ */
+static inline BTScanInsert
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->pivotsearch = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 9943e8ecd4..3493f482b8 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a..cb23be859d 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a295a7a286..40ff25fe06 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -615,22 +626,40 @@ scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -643,20 +672,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -664,4 +699,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 54dfa0a5e1..04cce2404a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -62,14 +62,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel,
 					 BTInsertState insertstate);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -117,6 +119,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * Fill in the BTInsertState working area, to track the current page and
@@ -232,12 +237,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tiebreaker attribute.  Any other would-be inserter of
+	 * the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -275,6 +281,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -283,12 +293,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
 
@@ -300,8 +310,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
-					   newitemoff, false);
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+					   itup, newitemoff, false);
 	}
 	else
 	{
@@ -374,6 +384,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -645,20 +656,26 @@ _bt_check_unique(Relation rel, BTInsertState insertstate,
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
  *		On entry, insertstate buffer contains the page that the new tuple
- *		belongs on, which is exclusive-locked and pinned by caller.  This
- *		won't be exactly the right page for some callers to insert on to.
- *		They'll have to insert into a page somewhere to the right.
+ *		belongs on, which is exclusive-locked and pinned by caller.
+ *		Occasionally, this won't be exactly right for callers that just called
+ *		_bt_check_unique(), and did initial search without using a scantid.
+ *		They'll have to insert into a page somewhere to the right in rare
+ *		cases where there are many physical duplicates in a unique index, and
+ *		their scantid directs us to some page full of duplicates to the right,
+ *		where the new tuple must go.  (Actually, since !heapkeyspace
+ *		pg_upgraded'd non-unique indexes never get a scantid, they too may
+ *		require that we move right.  We treat them somewhat like unique
+ *		indexes.)
  *
  *		On exit, insertstate buffer contains the chosen insertion page, and
  *		the offset within that page is returned.  The lock and pin on the
- *		original page are released in cases where initial page is not where
- *		tuple belongs.  New buffer/page passed back to the caller is
+ *		original page are released in rare cases where initial page is not
+ *		where tuple belongs.  New page passed back to the caller is
  *		exclusively locked, just like initial page was.
  *
  *		_bt_check_unique() saves the progress of the binary search it
  *		performs, also in the insertion state.  We don't need to do any
- *		additional binary search comparisons here most of the time, provided
- *		caller is to insert on original page.
+ *		additional binary search comparisons here much of the time.
  *
  *		This is also where opportunistic microvacuuming of LP_DEAD tuples
  *		occurs.  It convenient to make it happen here, since microvacuuming
@@ -680,28 +697,26 @@ _bt_findinsertloc(Relation rel,
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (insertstate->itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						insertstate->itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 insertstate->itup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -709,6 +724,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * An earlier _bt_check_unique() call may well have established bounds
 		 * that we can use to skip the high key check for checkingunique
 		 * callers.  This fastpath cannot be used when there are no items on
@@ -716,9 +738,11 @@ _bt_findinsertloc(Relation rel,
 		 * new item belongs last on the page, but it might go on a later page
 		 * instead.
 		 */
-		if (insertstate->bounds_valid &&
-			insertstate->low <= insertstate->stricthigh &&
-			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (insertstate->bounds_valid &&
+				 insertstate->low <= insertstate->stricthigh &&
+				 insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		/*
@@ -728,14 +752,24 @@ _bt_findinsertloc(Relation rel,
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.  Note that this may invalidate cached bounds for us.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.  Note that this may invalidate
+			 * cached bounds for us.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -744,6 +778,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -810,9 +846,16 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
- *		This function handles the question of whether or not an insertion of
- *		a duplicate into an index should insert on the page contained in buf
- *		when a choice must be made.
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should insert
+ *		on the page contained in buf when a choice must be made.  It is only
+ *		used with pg_upgrade'd version 2 and version 3 indexes (!heapkeyspace
+ *		indexes).
  *
  *		If the current page doesn't have enough free space for the new tuple
  *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
@@ -831,6 +874,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel,
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+	Assert(!insertstate->itup_key->heapkeyspace);
 
 	/* Easy case -- there is space free on this page already */
 	if (PageGetFreeSpace(page) >= insertstate->itemsz)
@@ -879,8 +923,9 @@ _bt_useduplicatepage(Relation rel, Relation heapRel,
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page, using 'itup_key' for
+ *			   suffix truncation on leaf pages (caller passes NULL for
+ *			   non-leaf pages).
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -906,6 +951,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel,
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -928,7 +974,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -978,8 +1024,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1063,7 +1109,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1118,6 +1164,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1185,17 +1233,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1289,7 +1339,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1303,8 +1354,29 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach of
+	 * using the left page's last item as its new high key when splitting on
+	 * the leaf level.  It isn't, though: suffix truncation will leave the
+	 * left page's high key fully equal to the last item on the left page when
+	 * two tuples with equal key values (excluding heap TID) enclose the split
+	 * point.  It isn't actually necessary for a new leaf high key to be equal
+	 * to the last item on the left for the L&Y "subtree" invariant to hold.
+	 * It's sufficient to make sure that the new leaf high key is strictly
+	 * less than the first item on the right leaf page, and greater than or
+	 * equal to (not necessarily equal to) the last item on the left leaf
+	 * page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap TID
+	 * from lastleft will be used as the new high key, since the last left
+	 * tuple could be physically larger despite being opclass-equal in respect
+	 * of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1322,25 +1394,48 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate unneeded key and non-key attributes of the high key item
+	 * before inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * level high keys.  A pivot tuple in a grandparent page must guide a
+	 * search not only to the correct parent page, but also to the correct
+	 * leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  This
+		 * is needed to decide how many attributes from the first item on the
+		 * right page must remain in new high key for left page.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1533,7 +1628,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1562,22 +1656,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1593,9 +1675,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1958,7 +2038,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1988,7 +2068,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2249,7 +2329,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2284,7 +2364,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2317,6 +2398,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2381,6 +2464,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2396,8 +2480,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2410,6 +2494,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..aa1a2c0bbd 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1364,7 +1436,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 * We need an approximate pointer to the page's parent page.  We
 			 * use the standard search mechanism to search for the page's high
 			 * key; this will give us a link to either the current parent or
-			 * someplace to its left (if there are multiple equal high keys).
+			 * someplace to its left (if there are multiple equal high keys,
+			 * which is possible with !heapkeyspace indexes).
 			 *
 			 * Also check if this is the right-half of an incomplete split
 			 * (see comment above).
@@ -1422,7 +1495,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* get stack to leaf page by searching index */
+				/* Search should not locate page with first non-pivot match */
+				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 60e0b90ccf..ac6f1eb342 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2d26d1f0dc..3200bbad30 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -153,8 +153,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -252,11 +256,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -348,6 +354,9 @@ _bt_binsrch(Relation rel,
 	int32		result,
 				cmpval;
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -552,10 +561,14 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -565,6 +578,7 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -578,8 +592,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -630,8 +646,77 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches have a scankey that is considered greater than a
+		 * truncated pivot tuple if and when the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in tuple.
+		 *
+		 * For example, if an index has the minimum two attributes (single
+		 * user key attribute, plus heap TID attribute), and a page's high key
+		 * is ("foo", -inf), and scankey is ("foo", <omitted>), the search
+		 * will not descend to the page to the left.  The search will descend
+		 * right instead.  The truncated attribute in pivot tuple means that
+		 * all non-pivot tuples on the page to the left must be strictly <
+		 * "foo", so it isn't necessary to visit it.  In other words, caller
+		 * doesn't have to descend to the left because it isn't interested in
+		 * a match that has a heap TID value of -inf.
+		 *
+		 * However, some callers (pivotsearch callers) actually require that
+		 * we descend left when this happens.  Minus infinity is treated as a
+		 * possible match for omitted scankey attributes.  This is useful for
+		 * page deletion, which needs to relocate a leaf page using its high
+		 * key, rather than relocating its right sibling (the right sibling is
+		 * the first page a non-pivot match can be found on).
+		 *
+		 * Note: the heap TID part of this test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key attributes
+		 * (often though not necessarily just the heap TID attribute).
+		 *
+		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+		 * left here, since they have no heap TID attribute (and cannot have
+		 * truncated attributes in any case).  They must always be prepared to
+		 * deal with matches on both sides of the pivot once the leaf level is
+		 * reached.
+		 */
+		if (key->heapkeyspace && !key->pivotsearch &&
+			key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1146,7 +1231,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
 	inskey.nextkey = nextkey;
+	inskey.pivotsearch = false;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a0e2e70cef..2762a2d548 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -755,6 +755,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -808,8 +809,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -826,27 +825,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -892,24 +885,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -917,7 +921,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -936,8 +944,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -982,7 +991,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1041,8 +1050,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1135,6 +1145,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1142,7 +1154,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1159,6 +1170,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 0250e089a6..12fd0a7b0d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,26 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's pivotsearch field to true.  This
+ *		allows caller to search for a leaf page with a matching high key,
+ *		which is usually to the left of the first leaf page a non-pivot match
+ *		might appear on.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
+ *		for _bt_truncate()) can pass a NULL index tuple.  The scankey will
+ *		be initialized as if an "all truncated" pivot tuple was passed
+ *		instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,13 +98,18 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
 	key->nextkey = false;
+	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -101,9 +125,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are
-		 * defensively represented as NULL values.  They should never be
-		 * used.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values. They
+		 * should never be used.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2041,38 +2065,234 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only usable value.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 2/3 tuples
+	 * across Postgres versions; don't allow new pivot tuples to have
+	 * truncated key attributes there.  _bt_compare() treats truncated key
+	 * attributes as having the value minus infinity, which would break
+	 * searches within !heapkeyspace indexes.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2086,15 +2306,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2114,16 +2336,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2133,8 +2365,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2146,7 +2385,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2157,18 +2400,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tiebreaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..7f261db901 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -282,8 +266,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		Page		lpage = (Page) BufferGetPage(lbuf);
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
-		IndexTuple	newitem = NULL;
-		Size		newitemsz = 0;
+		IndexTuple	newitem,
+					left_hikey;
+		Size		newitemsz,
+					left_hikeysz;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 16bda5c586..3eebd9ef51 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7d7faa59c2..a4bce7a13c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,45 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree versions 2 and 3, so that they don't
+ * need to be immediately re-indexed at pg_upgrade.  In order to get the
+ * new heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is mostly the same as version 3.  There are two new
+ * fields in the metapage that were introduced in version 3.  A version 2
+ * metapage will be automatically upgraded to version 3 on the first
+ * insert to it.  INCLUDE indexes cannot use version 2.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tiebreaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +214,73 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
+ * tuples (non-pivot tuples).
  *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * Non-pivot tuple format:
  *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tiebreaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set.
+ *
+ * All other types of index tuples (collectively, "pivot" tuples) only
+ * have key columns, since pivot tuples only ever need to represent how
+ * the key space is separated.  In general, any B-Tree index that has more
+ * than one level (i.e. any index that does not just consist of a metapage
+ * and a single leaf root page) must have some number of pivot tuples,
+ * since pivot tuples are used for traversing the tree.  Suffix truncation
+ * can omit trailing key columns when a new pivot is formed, which makes
+ * minus infinity their logical value.  Since BTREE_VERSION 4 indexes
+ * treat heap TID as a trailing key column that ensures that all index
+ * tuples are unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set.  In
+ * that case, the number key columns is implicitly the same as the number
+ * of key columns in the index.  It is never set on version 2 indexes,
+ * which predate the introduction of INCLUDE indexes. (Only explicitly
+ * truncated pivot tuples explicitly represent the number of key columns
+ * on version 3, whereas all pivot tuples are formed using truncation on
+ * version 4.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +303,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tiebreaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +321,52 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * Only BTREE_VERSION 4 indexes treat heap TID as a tiebreaker key attribute.
+ * This macro can be used with tuples from indexes that use earlier versions,
+ * even though the result won't be meaningful.  The expectation is that higher
+ * level code will ensure that the result is never used, for example by never
+ * providing a scantid that the result is compared against.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that
+ * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set
+ * (since all pivot tuples must as of BTREE_VERSION 4).  When non-pivot
+ * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll
+ * probably also contain a heap TID at the end of the tuple.  We currently
+ * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within
+ * heapkeyspace indexes (and that a tuple without it set must be a non-pivot
+ * tuple), but it might also be used by non-pivot tuples in the future.
+ * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot
+ * tuples that actually originated with the truncation of one or more
+ * attributes.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -325,23 +431,45 @@ typedef BTStackData *BTStack;
  * be confused with a search scankey).  It's used to descend a B-Tree using
  * _bt_search.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be unique by
+ * treating heap TID as a tiebreaker attribute (i.e. the index is
+ * BTREE_VERSION 4+).  scantid should never be set when index is not a
+ * heapkeyspace index.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * pivotsearch is set to true by callers that want to relocate a leaf page
+ * using a scankey built from leaf pages high key.  Most callers set this to
+ * false.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * scantid is the heap TID that is used as a final tiebreaker attribute,
+ * which may be set to NULL to indicate its absence.  When inserting new
+ * tuples, it must be set, since every tuple in the tree unambiguously belongs
+ * in one exact position, even when there are entries in the tree that are
+ * considered duplicates by external code.  Unique insertions set scantid only
+ * after unique checking indicates that it's safe to insert.  Despite the
+ * representational difference, scantid is just another insertion scankey to
+ * routines like _bt_search.
+ *
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 
 typedef struct BTScanInsertData
 {
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
 	bool		nextkey;
+	bool		pivotsearch;	/* Searching for pivot tuple? */
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -601,6 +729,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -654,8 +783,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..6320a0098f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index d0c9f9a67f..f7891faa23 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..0b7582accb 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1995,16 +1995,13 @@ ERROR:  cannot attach a permanent relation as partition of temporary relation "t
 DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
-privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 NOTICE:  drop cascades to 5 other objects
 DROP SERVER s8 CASCADE;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
diff --git a/src/test/regress/sql/foreign_data.sql b/src/test/regress/sql/foreign_data.sql
index d6fb3fae4e..1cc1f6e012 100644
--- a/src/test/regress/sql/foreign_data.sql
+++ b/src/test/regress/sql/foreign_data.sql
@@ -805,11 +805,11 @@ DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 DROP SERVER t1 CASCADE;
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 DROP SERVER s8 CASCADE;
 \set VERBOSITY default
-- 
2.17.1

v17-0003-Consider-secondary-factors-during-nbtree-splits.patchapplication/octet-stream; name=v17-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From 18d110d3e7c5fdacd297103623a03b493d1a1cb6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v17 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 287 +-------
 src/backend/access/nbtree/nbtsplitloc.c | 846 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  49 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 955 insertions(+), 291 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40ff25fe06..ca4fdf7ac4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -661,6 +661,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 04cce2404a..82b32b2a2e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -74,13 +54,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -1020,7 +993,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1704,264 +1677,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..ead218d916
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,846 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval (default strategy only) */
+#define MAX_LEAF_INTERVAL			9
+#define MAX_INTERNAL_INTERVAL		18
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} FindSplitStrat;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int16		curdelta;		/* current leftfree/rightfree delta */
+	int16		leftfree;		/* space left on left page post-split */
+	int16		rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recsplitloc */
+	Relation	rel;			/* index relation */
+	Page		page;			/* page undergoing split */
+	IndexTuple	newitem;		/* new item (cause of page split) */
+	Size		newitemsz;		/* size of newitem (includes line pointer) */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+	Size		minfirstrightsz;	/* smallest firstoldonright tuple size */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			interval;		/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
+				 bool *newitemonleft);
+static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy);
+static void _bt_interval_edges(FindSplitData *state,
+				   SplitPoint **leftinterval, SplitPoint **rightinterval);
+static inline int _bt_split_penalty(FindSplitData *state, SplitPoint *split);
+static inline IndexTuple _bt_split_lastleft(FindSplitData *state,
+				   SplitPoint *split);
+static inline IndexTuple _bt_split_firstright(FindSplitData *state,
+					 SplitPoint *split);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * there are a number of further special cases where fillfactor is not
+ * applied in the standard way.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	FindSplitStrat strategy;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	double		fillfactormult;
+	bool		usemult;
+	SplitPoint	leftpage,
+				rightpage;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.rel = rel;
+	state.page = page;
+	state.newitem = newitem;
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.minfirstrightsz = SIZE_MAX;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all points
+	 * between tuples will be legal).
+	 */
+	state.maxsplits = maxoff;
+	state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (any other split whose firstoldonright is the first data offset
+	 * can't be legal, though, and so won't actually end up being recorded in
+	 * first loop iteration).
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default strategy), while applying
+	 * rightmost where appropriate.  Either of the two other fallback
+	 * strategies may be required for cases with a large number of duplicates
+	 * around the original/space-optimal split point.
+	 *
+	 * Default strategy gives some weight to suffix truncation in deciding a
+	 * split point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = state.is_rightmost;
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (state.is_rightmost)
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		/* fillfactormult not used, but be tidy */
+		fillfactormult = 0.50;
+	}
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	state.interval = Min(Max(1, state.nsplits * 0.05),
+						 state.is_leaf ? MAX_LEAF_INTERVAL :
+						 MAX_INTERNAL_INTERVAL);
+
+	/*
+	 * Save leftmost and rightmost splits for page before original ordinal
+	 * sort order is lost by delta/fillfactormult sort
+	 */
+	leftpage = state.splits[0];
+	rightpage = state.splits[state.nsplits - 1];
+
+	/* Give split points a fillfactormult-wise delta, and sort on deltas */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Determine if default strategy/split interval will produce a
+	 * sufficiently distinguishing split, or if we should change strategies.
+	 * Alternative strategies change the range of split points that are
+	 * considered acceptable (split interval), and possibly change
+	 * fillfactormult, in order to deal with pages with a large number of
+	 * duplicates gracefully.
+	 *
+	 * Pass low and high splits for the entire page (including even newitem).
+	 * These are used when the initial split interval encloses split points
+	 * that are full of duplicates, and we need to consider if it's even
+	 * possible to avoid appending a heap TID.
+	 */
+	perfectpenalty = _bt_strategy(&state, &leftpage, &rightpage, &strategy);
+
+	if (strategy == SPLIT_DEFAULT)
+	{
+		/*
+		 * Default strategy worked out (always works out with internal page).
+		 * Original split interval still stands.
+		 */
+	}
+
+	/*
+	 * Many duplicates strategy is used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates.
+	 *
+	 * The split interval is widened to include all legal candidate split
+	 * points.  There may be a few as two distinct values in the whole-page
+	 * split interval.  Many duplicates strategy has no hard requirements for
+	 * space utilization, though it still keeps the use of space balanced as a
+	 * non-binding secondary goal (perfect penalty is set so that the
+	 * first/lowest delta split points that avoids appending a heap TID is
+	 * used).
+	 *
+	 * Single value strategy is used when it is impossible to avoid appending
+	 * a heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently.  (Single value strategy is harmless though not
+	 * particularly useful with !heapkeyspace indexes.)
+	 */
+	else if (strategy == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.interval = state.nsplits;
+	}
+	else if (strategy == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split near the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		/* Appending a heap TID is unavoidable, so interval of 1 is fine */
+		state.interval = 1;
+	}
+
+	/*
+	 * Search among acceptable split points (using final split interval) for
+	 * the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(&state, perfectpenalty, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int16		leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int16) (firstrightitemsz +
+							 MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int16) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int16) state->newitemsz;
+	else
+		rightfree -= (int16) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int16) firstrightitemsz -
+			(int16) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		Assert(state->nsplits < state->maxsplits);
+
+		/* Determine smallest firstright item size on page */
+		state->minfirstrightsz = Min(state->minfirstrightsz, firstrightitemsz);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int16		delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty among split points
+ * that fall within current/final split interval.  Penalty is an abstract
+ * score, with a definition that varies depending on whether we're splitting a
+ * leaf page or an internal page.  See _bt_split_penalty() for details.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice (or when we only want a
+ * minimally distinguishing split point, and don't want to make the split any
+ * more unbalanced than is necessary).
+ *
+ * We return the index of the first existing tuple that should go on the right
+ * page, plus a boolean indicating if new item is on left of split point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(FindSplitData *state, int perfectpenalty, bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->interval, state->nsplits);
+
+	/* No point in calculating penalty when there's only one choice */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(state, state->splits + i);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to decide whether split should use default strategy/initial
+ * split interval, or whether it should finish splitting the page using
+ * alternative strategies (this is only possible with leaf pages).
+ *
+ * Caller uses alternative strategy (or sticks with default strategy) based
+ * on how *strategy is set here.  Return value is "perfect penalty", which is
+ * passed to _bt_bestsplitloc() as a final constraint on how far caller is
+ * willing to go to avoid appending a heap TID when using the many duplicates
+ * strategy (it also saves _bt_bestsplitloc() useless cycles).
+ */
+static int
+_bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy)
+{
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *leftinterval,
+			   *rightinterval;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume that alternative strategy won't be used for now */
+	*strategy = SPLIT_DEFAULT;
+
+	/*
+	 * Use smallest observed first right item size for entire page as perfect
+	 * penalty on internal pages.  This can save cycles in the common case
+	 * where most or all splits (not just splits within interval) have first
+	 * right tuples that are the same size.
+	 */
+	if (!state->is_leaf)
+		return state->minfirstrightsz;
+
+	/*
+	 * Use leftmost and rightmost tuples from leftmost and rightmost splits in
+	 * current split interval
+	 */
+	_bt_interval_edges(state, &leftinterval, &rightinterval);
+	leftmost = _bt_split_lastleft(state, leftinterval);
+	rightmost = _bt_split_firstright(state, rightinterval);
+
+	/*
+	 * If initial split interval can produce a split point that will at least
+	 * avoid appending a heap TID in new high key, we're done.  Finish split
+	 * with default strategy and initial split interval.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+		return perfectpenalty;
+
+	/*
+	 * Work out how caller should finish split when even their "perfect"
+	 * penalty for initial/default split interval indicates that the interval
+	 * does not contain even a single split that avoids appending a heap TID.
+	 *
+	 * Use the leftmost split's lastleft tuple and the rightmost split's
+	 * firstright tuple to assess every possible split.
+	 */
+	leftmost = _bt_split_lastleft(state, leftpage);
+	rightmost = _bt_split_firstright(state, rightpage);
+
+	/*
+	 * If page (including new item) has many duplicates but is not entirely
+	 * full of duplicates, a many duplicates strategy split will be performed.
+	 * If page is entirely full of duplicates, a single value strategy split
+	 * will be performed.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+	{
+		*strategy = SPLIT_MANY_DUPLICATES;
+
+		/*
+		 * Caller should choose the lowest delta split that avoids appending a
+		 * heap TID.  Maximizing the number of attributes that can be
+		 * truncated away (returning perfectpenalty when it happens to be less
+		 * than the number of key attributes in index) can result in continual
+		 * unbalanced page splits.
+		 *
+		 * Just avoiding appending a heap TID can still make splits very
+		 * unbalanced, but this is self-limiting.  When final split has a very
+		 * high delta, one side of the split will likely consist of a single
+		 * value.  If that page is split once again, then that split will
+		 * likely use the single value strategy.
+		 */
+		return indnkeyatts;
+	}
+
+	/*
+	 * Single value strategy is only appropriate with ever-increasing heap
+	 * TIDs; otherwise, original default strategy split should proceed to
+	 * avoid pathological performance.  Use page high key to infer if this is
+	 * the rightmost page among pages that store the same duplicate value.
+	 * This should not prevent insertions of heap TIDs that are slightly out
+	 * of order from using single value strategy, since that's expected with
+	 * concurrent inserters of the same duplicate value.
+	 */
+	else if (state->is_rightmost)
+		*strategy = SPLIT_SINGLE_VALUE;
+	else
+	{
+		ItemId		itemid;
+		IndexTuple	hikey;
+
+		itemid = PageGetItemId(state->page, P_HIKEY);
+		hikey = (IndexTuple) PageGetItem(state->page, itemid);
+		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
+											 state->newitem);
+		if (perfectpenalty <= indnkeyatts)
+			*strategy = SPLIT_SINGLE_VALUE;
+		else
+		{
+			/*
+			 * Have caller finish split using default strategy, since page
+			 * does not appear to be the rightmost page for duplicates of the
+			 * value the page is filled with
+			 */
+		}
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to locate leftmost and rightmost splits for current/default
+ * split interval.  Note that it will be the same split iff there is only one
+ * split in interval.
+ */
+static void
+_bt_interval_edges(FindSplitData *state, SplitPoint **leftinterval,
+				   SplitPoint **rightinterval)
+{
+	int			highsplit = Min(state->interval, state->nsplits);
+	SplitPoint *deltaoptimal;
+
+	deltaoptimal = state->splits;
+	*leftinterval = NULL;
+	*rightinterval = NULL;
+
+	/*
+	 * Delta is an absolute distance to optimal split point, so both the
+	 * leftmost and rightmost split point will usually be at the end of the
+	 * array
+	 */
+	for (int i = highsplit - 1; i >= 0; i--)
+	{
+		SplitPoint *distant = state->splits + i;
+
+		if (distant->firstoldonright < deltaoptimal->firstoldonright)
+		{
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->firstoldonright > deltaoptimal->firstoldonright)
+		{
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else if (!distant->newitemonleft && deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become first on right page" (distant) is
+			 * to the left of "incoming tuple will become last on left page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->newitemonleft && !deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become last on left page" (distant) is to
+			 * the right of "incoming tuple will become first on right page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else
+		{
+			/* There was only one or two splits in initial split interval */
+			Assert(distant == deltaoptimal);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+
+		if (*leftinterval && *rightinterval)
+			return;
+	}
+
+	Assert(false);
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (including item pointer overhead).  This tuple will
+ * become the new high key for the left page.
+ */
+static inline int
+_bt_split_penalty(FindSplitData *state, SplitPoint *split)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	if (!state->is_leaf)
+	{
+		ItemId		itemid;
+
+		if (!split->newitemonleft &&
+			split->firstoldonright == state->newitemoff)
+			return state->newitemsz;
+
+		itemid = PageGetItemId(state->page, split->firstoldonright);
+
+		return MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+	}
+
+	lastleftuple = _bt_split_lastleft(state, split);
+	firstrighttuple = _bt_split_firstright(state, split);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(state->rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page,
+						   OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 12fd0a7b0d..6e4f907d9a 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2295,6 +2296,54 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast, approximate variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.  This is inherently approximate, but usually provides the same
+ * answer as the authoritative approach that _bt_keep_natts takes, since the
+ * vast majority of types in Postgres cannot be equal according to any
+ * available opclass unless they're bitwise equal.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a4bce7a13c..a2a5888568 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -160,11 +160,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -719,6 +723,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -785,6 +796,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v17-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchapplication/octet-stream; name=v17-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patchDownload

From 348cae2ac9fd7732b4bf3296c1727bb257e096ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v17 4/7] Allow tuples to be relocated from root by amcheck.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be relocated using a new search that
starts from the root page.

The new "relocate" verification option is exhaustive, and can therefore
make a call to bt_index_parent_check() take a lot longer.  Relocating
tuples during verification is intended as an option for backend
developers, since the corruption scenarios that it alone is uniquely
capable of detecting seem fairly far-fetched.  For example, "relocate"
verification is generally the only way of detecting corruption of the
least significant byte of a key from a pivot tuple in the root page,
since only a few tuples on a cousin leaf page are liable to "be
overlooked" by index scans.

Author: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 138 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 162 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..de7b657f2f
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, relocate boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..687fde8fce 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..d33d3e6682 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and relocate
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 027dfce78a..fa2f0eab9a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -75,6 +75,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also relocating non-pivot tuples? */
+	bool		relocate;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -124,10 +126,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool relocate);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool relocate);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -140,6 +143,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -177,7 +181,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -196,11 +200,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		relocate = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		relocate = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, relocate);
 
 	PG_RETURN_VOID();
 }
@@ -209,7 +216,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool relocate)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -267,7 +275,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, relocate);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -338,7 +346,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool relocate)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -362,6 +370,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->relocate = relocate;
 
 	if (state->heapallindexed)
 	{
@@ -430,6 +439,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->relocate || state->readonly);
+	if (state->relocate && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("index \"%s\" does not support relocating tuples",
+						RelationGetRelationName(rel)),
+				 errhint("Only indexes initialized on PostgreSQL 12 support relocation verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -922,6 +939,32 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally relocate non-pivot tuples for
+		 * heapkeyspace indexes.  A new search starting from the root
+		 * relocates every current entry in turn.
+		 */
+		if (state->relocate && P_ISLEAF(topaque) &&
+			!bt_relocate_from_root(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not relocate tuple in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1526,6 +1569,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_relocate_from_root(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1926,6 +1972,82 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to relocate itup without visiting more
+ * than one page on each level, barring an interrupted page split, where we
+ * may have to move right.  (A concurrent page split is impossible because
+ * caller must be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)
+ */
+static bool
+bt_relocate_from_root(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal an inconsistency.
+	 */
+	Assert(state->readonly && state->relocate);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		BTInsertStateData insertstate;
+		OffsetNumber offnum;
+		Page		page;
+
+		insertstate.itup = itup;
+		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+		insertstate.itup_key = key;
+		insertstate.bounds_valid = false;
+		insertstate.buf = lbuf;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch_insert(state->rel, &insertstate);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..c638456638 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, relocate boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>relocate</parameter>
+      argument is <literal>true</literal>, verification relocates
+      tuples on the leaf level by performing a new search from the
+      root page.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v17-0006-Add-high-key-continuescan-optimization.patchapplication/octet-stream; name=v17-0006-Add-high-key-continuescan-optimization.patchDownload

From f9667567c2ed63ad2a5588611f77315013a35140 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 14 Mar 2019 12:01:48 -0700
Subject: [PATCH v17 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the final non-pivot item, even when it's clear that it cannot
be returned to the scan due to being dead-to-all, for the same reason.
Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans (where the final non-pivot tuple is
the first non-pivot tuple on the page).  Like the existing check, the
new check won't always work out, but that seems like an acceptable price
to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  Recent
improvements to the logic for picking a split point make it likely that
relatively dissimilar high keys will appear on a page.  A distinguishing
key value that only appears on non-pivot tuples on the page to the right
will often be present in leaf high keys.

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c |  89 ++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  | 113 +++++++++++---------------
 src/include/access/nbtree.h           |   5 +-
 3 files changed, 130 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3200bbad30..740c34092c 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1404,8 +1404,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
 	int			itemIndex;
-	IndexTuple	itup;
 	bool		continuescan;
+	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1425,6 +1425,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1466,23 +1468,60 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum <= maxoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  It's a win to
+			 * not bother examining the tuple's index keys, but just skip to
+			 * the next tuple.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				offnum = OffsetNumberNext(offnum);
+				continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * We don't need to visit page to the right when the high key
+		 * indicates that no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
+			int			truncatt;
+
+			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+		}
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
@@ -1497,8 +1536,40 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum >= minoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		tuple_alive;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple (previous, actually,
+			 * since we're scanning backwards).  However, if this is the first
+			 * tuple on the page, we do check the index keys, to prevent
+			 * uselessly advancing to the page to the left.  This is similar
+			 * to the high key optimization used by forward scans.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				Assert(offnum >= P_FIRSTDATAKEY(opaque));
+				if (offnum > P_FIRSTDATAKEY(opaque))
+				{
+					offnum = OffsetNumberPrev(offnum);
+					continue;
+				}
+
+				tuple_alive = false;
+			}
+			else
+				tuple_alive = true;
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+										 &continuescan);
+			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				itemIndex--;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 6e4f907d9a..5ab55a76b1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1333,73 +1333,35 @@ _bt_mark_scankey_required(ScanKey skey)
 /*
  * Test whether an indextuple satisfies all the scankey conditions.
  *
- * If so, return the address of the index tuple on the index page.
- * If not, return NULL.
+ * Return true if so, false if not.  If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly.  See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
  *
- * If the tuple fails to pass the qual, we also determine whether there's
- * any need to continue the scan beyond this tuple, and set *continuescan
- * accordingly.  See comments for _bt_preprocess_keys(), above, about how
- * this is done.
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescanthat to false, and avoiding an unnecessary visit to
+ * the page to the right.
  *
  * scan: index scan descriptor (containing a search-type scankey)
- * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
- *
- * Caller must hold pin and lock on the index page.
  */
-IndexTuple
-_bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			  ScanDirection dir, bool *continuescan)
 {
-	ItemId		iid = PageGetItemId(page, offnum);
-	bool		tuple_alive;
-	IndexTuple	tuple;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
 	*continuescan = true;		/* default assumption */
 
-	/*
-	 * If the scan specifies not to return killed tuples, then we treat a
-	 * killed tuple as not passing the qual.  Most of the time, it's a win to
-	 * not bother examining the tuple's index keys, but just return
-	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
-	 */
-	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
-	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
-
-		/*
-		 * OK, we want to check the keys so we can set continuescan correctly,
-		 * but we'll return NULL even if the tuple passes the key tests.
-		 */
-		tuple_alive = false;
-	}
-	else
-		tuple_alive = true;
-
-	tuple = (IndexTuple) PageGetItem(page, iid);
-
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
 	keysz = so->numberOfKeys;
@@ -1410,13 +1372,25 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
-			return NULL;
+			return false;
 		}
 
 		datum = index_getattr(tuple,
@@ -1454,7 +1428,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		if (isNull)
@@ -1495,7 +1469,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
@@ -1523,16 +1497,12 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 	}
 
-	/* Check for failure due to it being a killed tuple. */
-	if (!tuple_alive)
-		return NULL;
-
 	/* If we get here, the tuple passes all index quals. */
-	return tuple;
+	return true;
 }
 
 /*
@@ -1545,8 +1515,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1563,6 +1533,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		if (subkey->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a2a5888568..53c1a691f7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -780,9 +780,8 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
-			  ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
+			  int tupnatts, ScanDirection dir, bool *continuescan);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
-- 
2.17.1

v17-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v17-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From dfd8a953be512658f0341d247d48679bde294362 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v17 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

#104

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#103)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 16/03/2019 06:16, Peter Geoghegan wrote:

On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It doesn't matter how often it happens, the code still needs to deal
with it. So let's try to make it as readable as possible.

Well, IMHO holding the buffer and the bounds in the new struct is more
clean than the savebinsrc/restorebinsrch flags. That's exactly why I
suggested it. I don't know what else to suggest. I haven't done any
benchmarking, but I doubt there's any measurable difference.

Fair enough. Attached is v17, which does it using the approach taken
in your earlier prototype. I even came around to your view on
_bt_binsrch_insert() -- I kept that part, too. Note, however, that I
still pass checkingunique to _bt_findinsertloc(), because that's a
distinct condition to whether or not bounds were cached (they happen
to be the same thing right now, but I don't want to assume that).

This revision also integrates your approach to the "continuescan"
optimization patch, with the small tweak I mentioned yesterday (we
also pass ntupatts). I also prefer this approach.

Great, thank you!

It would be nice if you could take a look at the amcheck "relocate"
patch

When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.

It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?

+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, it's
+ * probably a useful testing strategy to exhaustively verify that all
+ * non-pivot tuples can be relocated in the index using the same code paths as
+ * those used by index scans.)

I don't understand this. Can you give an example of this kind of
inconsistency?

- Heikki

#105

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#104)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would be nice if you could take a look at the amcheck "relocate"
patch

When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.

Okay. What do you suggest? :-)

It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

I don't understand this. Can you give an example of this kind of
inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

--
Peter Geoghegan

#106

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#105)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 16/03/2019 10:51, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It would be nice if you could take a look at the amcheck "relocate"
patch

When I started looking at this, I thought that "relocate" means "move".
So I thought that the new mode would actually move tuples, i.e. that it
would modify the index. That sounded horrible. Of course, it doesn't
actually do that. It just checks that each tuple can be re-found, or
"relocated", by descending the tree from the root. I'd suggest changing
the language to avoid that confusion.

Okay. What do you suggest? :-)

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.

It seems like a nice way to catch all kinds of index corruption issues.
Although, we already check that the tuples are in order within the page.
Is it really necessary to traverse the tree for every tuple, as well?
Maybe do it just for the first and last item?

It's mainly intended as a developer option. I want it to be possible
to detect any form of corruption, however unlikely. It's an
adversarial mindset that will at least make me less nervous about the
patch.

Fair enough.

At first, I thought this would be horrendously expensive, but thinking
about it a bit more, nearby tuples will always follow the same path
through the upper nodes, so it'll all be cached. So maybe it's not quite
so bad.

I don't understand this. Can you give an example of this kind of
inconsistency?

The commit message gives an example, but I suggest trying it out for
yourself. Corrupt the least significant key byte of a root page of a
B-Tree using pg_hexedit. Say it's an index on a text column, then
you'd corrupt the last byte to be something slightly wrong. Then, the
only way to catch it is with "relocate" verification. You'll only miss
a few tuples on a cousin page at the leaf level (those on either side
of the high key that the corrupted separator key in the root was
originally copied from).

The regular checks won't catch this, because the keys are similar
enough one level down. The "minus infinity" item is a kind of a blind
spot, because we cannot do a parent check of its children, because we
don't have the key (it's truncated when the item becomes a right page
minus infinity item, during an internal page split).

Hmm. So, the initial situation would be something like this:

+-----------+
| 1: root |
| |
| -inf -> 2 |
| 20 -> 3 |
| |
+-----------+

+-------------+ +-------------+
| 2: internal | | 3: internal |
| | | |
| -inf -> 4 | | -inf -> 6 |
| 10 -> 5 | | 30 -> 7 |
| | | |
| hi: 20 | | |
+-------------+ +-------------+

+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
|         | |         | |         | |         |
| 1       | | 11      | | 21      | | 31      |
| 5       | | 15      | | 25      | | 35      |
| 9       | | 19      | | 29      | | 39      |
|         | |         | |         | |         |
| hi: 10  | | hi: 20  | | hi: 30  | |         |
+---------+ +---------+ +---------+ +---------+

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?

- Heikki

#107

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#106)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.

WFM.

At first, I thought this would be horrendously expensive, but thinking
about it a bit more, nearby tuples will always follow the same path
through the upper nodes, so it'll all be cached. So maybe it's not quite
so bad.

That's deliberate, though you could call bt_relocate_from_root() from
anywhere else if you wanted to. It's a bit like a big nested loop
join, where the inner side has locality.

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

--
Peter Geoghegan

#108

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#107)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.

WFM.

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

--
Peter Geoghegan

#109

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#108)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 16/03/2019 19:32, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hmm. "re-find", maybe? We use that term in a few error messages already,
to mean something similar.

WFM.

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

Works for me.

- Heikki

#110

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#107)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 16/03/2019 18:55, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Then, a cosmic ray changes the 20 on the root page to 18. That causes
the the leaf tuple 19 to become non-re-findable; if you descend the for
19, you'll incorrectly land on page 6. But it also causes the high key
on page 2 to be different from the downlink on the root page. Wouldn't
the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

AFAICS, there is a copy of every page's high key in its immediate
parent. Either in the downlink of the right sibling, or as the high key
of the parent page itself. Cross-checking those would catch any
corruption in high keys.

Note that this would potentially catch some corruption that the
descend-from-root would not. If you have a mismatch between the high key
of a leaf page and its parent or grandparent, all the current tuples
might be pass the rootdescend check. But a tuple might get inserted to
wrong location later.

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

Hmm, yeah, to check for stray values, you could follow the left-link,
get the high key of the left sibling, and compare against that.

- Heikki

#111

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#110)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

AFAICS, there is a copy of every page's high key in its immediate
parent. Either in the downlink of the right sibling, or as the high key
of the parent page itself. Cross-checking those would catch any
corruption in high keys.

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification. This is almost the same thing as what Graefe argues for
(don't think that you need a low key on the leaf level, since you can
cross a single level there). I wonder if truncating the negative
infinity item in internal pages to zero attributes is actually worth
it, since a low key might be useful for a number of reasons.

Note that this would potentially catch some corruption that the
descend-from-root would not. If you have a mismatch between the high key
of a leaf page and its parent or grandparent, all the current tuples
might be pass the rootdescend check. But a tuple might get inserted to
wrong location later.

I also agree with this. However, the rootdescend check will always
work better than this in some cases that you can at least imagine, for
as long as there are negative infinity items to worry about. (And,
even if we decided not to truncate to support easy verification, there
is still a good argument to be made for involving the authoritative
_bt_search() code at some point).

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

Hmm, yeah, to check for stray values, you could follow the left-link,
get the high key of the left sibling, and compare against that.

Graefe argues for retaining a low key, even in leaf pages (the left
page's old high key becomes the left page's low key during a split,
and the left page's new high key becomes the new right pages low key
at the same time). It's useful for what he calls "write-optimized
B-Trees", and maybe even for optional compression. As I said earlier,
I guess you can just go left on the leaf level if you need to, and you
have all you need. But I'd need to think about it some more.

Point taken; rootdescend isn't enough to make verification exactly
perfect. But it makes verification approach being perfect, because
you're going to get right answers to queries as long as it passes (I
think). There could be a future problem for a future insertion that
you could also detect, but can't. But you'd have to be extraordinarily
unlucky to have that happen for any amount of time. Unlucky even by my
own extremely paranoid standard.

--
Peter Geoghegan

#112

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#111)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification.

Clarification: You'd fingerprint an entire pivot tuple -- key, block
number, everything. Then, you'd probe the Bloom filter for the high
key one level down, with the downlink block in the high key set to
point to the current sibling on the same level (the child level). As I
said, I think that the only reason that that cannot be done at present
is because of the micro-optimization of truncating the first item on
the right page to zero attributes during an internal page split. (We
can retain the key without getting rid of the hard-coded logic for
negative infinity within _bt_compare()).

bt_relocate_from_root() already has smarts around interrupted page
splits (with the incomplete split bit set).

Finally, you'd also make bt_downlink_check follow every downlink, not
all-but-one downlink (no more excuse for leaving out the first one if
we don't truncate during internal page splits).

--
Peter Geoghegan

#113

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#112)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that it's always true that the high key is also in the parent,
and we could cross-check that within the child. Actually, I should
probably figure out a way of arranging for the Bloom filter used
within bt_relocate_from_root() (which has been around since PG v11) to
include the key itself when fingerprinting. That would probably
necessitate that we don't truncate "negative infinity" items (it was
actually that way about 18 years ago), just for the benefit of
verification.

Clarification: You'd fingerprint an entire pivot tuple -- key, block
number, everything. Then, you'd probe the Bloom filter for the high
key one level down, with the downlink block in the high key set to
point to the current sibling on the same level (the child level). As I
said, I think that the only reason that that cannot be done at present
is because of the micro-optimization of truncating the first item on
the right page to zero attributes during an internal page split. (We
can retain the key without getting rid of the hard-coded logic for
negative infinity within _bt_compare()).

bt_relocate_from_root() already has smarts around interrupted page
splits (with the incomplete split bit set).

Clarification to my clarification: I meant
bt_downlink_missing_check(), not bt_relocate_from_root(). The former
really has been around since v11, unlike the latter, which is part of
this new amcheck patch we're discussing.

--
Peter Geoghegan

#114

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#109)

7 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

Works for me.

Attached is v18 of patch series, which calls the new verification
option "rootdescend" verification.

As previously stated, I intend to commit the first 4 patches (up to
and including this amcheck "rootdescend" patch) during the workday
tomorrow, Pacific time.

Other changes:

* Further consolidation of the nbtree.h comments from second patch, so
that the on-disk representation overview that you requested a while
back has all the details. A couple of these were moved from macro
comments also in nbtree.h, and were missed earlier.

* Tweaks to comments on _bt_binsrch_insert() and its callers.
Streamlined to reflect the fact that it doesn't need to talk so much
about cases that only apply to internal pages. Explicitly stated
requirements for caller.

* Made _bt_binsrch_insert() set InvalidOffsetNumber for bounds in case
were valid bounds cannot be established initially. This seemed like a
good idea.

* A few more defensive assertion were added to nbtinsert.c (also
related to _bt_binsrch_insert()).

Thanks
--
Peter Geoghegan

Attachments:

v18-0005-Add-split-after-new-tuple-optimization.patchapplication/octet-stream; name=v18-0005-Add-split-after-new-tuple-optimization.patchDownload

From e6afa82352a94311e7f2541c33aea292ef015018 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 16:48:08 -0700
Subject: [PATCH v18 5/7] Add "split after new tuple" optimization.

Add additional heuristics to the algorithm for locating an optimal split
location.  New logic identifies localized monotonically increasing
values in indexes with multiple columns.  When this insertion pattern is
detected, page splits split just after the new item that provoked a page
split (or apply leaf fillfactor in the style of a rightmost page split).
This optimization is a variation of the long established leaf fillfactor
optimization used during rightmost page splits.

50/50 page splits are only appropriate with a pattern of truly random
insertions, where the average space utilization ends up at 65% - 70%.
Without this patch, affected cases have leaf pages that are no more than
about 50% full on average.  Future insertions can never make use of the
free space left behind.  With this patch, affected cases have leaf pages
that are about 90% full on average (assuming a fillfactor of 90).
Localized monotonically increasing insertion patterns are presumed to be
fairly common in real-world applications.

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com
---
 src/backend/access/nbtree/nbtsplitloc.c | 229 +++++++++++++++++++++++-
 1 file changed, 226 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 23cc593dc8..e148f1f2b8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -70,6 +70,9 @@ static void _bt_recsplitloc(FindSplitData *state,
 static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
 					bool usemult);
 static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static bool _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult);
+static bool _bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid);
 static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
 				 bool *newitemonleft);
 static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
@@ -249,9 +252,10 @@ _bt_findsplitloc(Relation rel,
 	 * Start search for a split point among list of legal split points.  Give
 	 * primary consideration to equalizing available free space in each half
 	 * of the split initially (start with default strategy), while applying
-	 * rightmost where appropriate.  Either of the two other fallback
-	 * strategies may be required for cases with a large number of duplicates
-	 * around the original/space-optimal split point.
+	 * rightmost and split-after-new-item optimizations where appropriate.
+	 * Either of the two other fallback strategies may be required for cases
+	 * with a large number of duplicates around the original/space-optimal
+	 * split point.
 	 *
 	 * Default strategy gives some weight to suffix truncation in deciding a
 	 * split point on leaf pages.  It attempts to select a split point where a
@@ -273,6 +277,44 @@ _bt_findsplitloc(Relation rel,
 		usemult = true;
 		fillfactormult = leaffillfactor / 100.0;
 	}
+	else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
+	{
+		/*
+		 * New item inserted at rightmost point among a localized grouping on
+		 * a leaf page -- apply "split after new item" optimization, either by
+		 * applying leaf fillfactor multiplier, or by choosing the exact split
+		 * point that leaves the new item as last on the left. (usemult is set
+		 * for us.)
+		 */
+		if (usemult)
+		{
+			/* fillfactormult should be set based on leaf fillfactor */
+			fillfactormult = leaffillfactor / 100.0;
+		}
+		else
+		{
+			/* find precise split point after newitemoff */
+			for (int i = 0; i < state.nsplits; i++)
+			{
+				SplitPoint *split = state.splits + i;
+
+				if (split->newitemonleft &&
+					newitemoff == split->firstoldonright)
+				{
+					pfree(state.splits);
+					*newitemonleft = true;
+					return newitemoff;
+				}
+			}
+
+			/*
+			 * Cannot legally split after newitemoff; proceed with split
+			 * without using fillfactor multiplier.  This is defensive, and
+			 * should never be needed in practice.
+			 */
+			fillfactormult = 0.50;
+		}
+	}
 	else
 	{
 		/* Other leaf page.  50:50 page split. */
@@ -519,6 +561,187 @@ _bt_splitcmp(const void *arg1, const void *arg2)
 	return 0;
 }
 
+/*
+ * Subroutine to determine whether or not the page should be split immediately
+ * after the would-be original page offset for the new/incoming tuple.  This
+ * is appropriate when there is a pattern of localized monotonically
+ * increasing insertions into a composite index, grouped by one or more
+ * leading attribute values.  This is prevalent in many real world
+ * applications.  Consider the example of a composite index on '(invoice_id,
+ * item_no)', where the item_no for each invoice is an identifier assigned in
+ * ascending order (invoice_id could itself be assigned in approximately
+ * monotonically increasing order, but that shouldn't matter).  Without this
+ * optimization, approximately 50% of space in leaf pages will be wasted by
+ * 50:50/!usemult page splits.  With this optimization, space utilization will
+ * be close to that of a similar index where all tuple insertions modify the
+ * current rightmost leaf page in the index (typically 90% for leaf pages).
+ *
+ * When the optimization is applied, the new/incoming tuple becomes the last
+ * tuple on the new left page.  (Actually, newitemoff > maxoff cases often use
+ * this optimization within indexes where monotonically increasing insertions
+ * of each grouping come in multiple "bursts" over time, such as a composite
+ * index on '(supplier_id, invoice_id, item_no)'.  Caller applies leaf
+ * fillfactor in the style of a rightmost leaf page split when netitemoff is
+ * at or very near the end of the original page.)
+ *
+ * This optimization may leave extra free space remaining on the rightmost
+ * page of a "most significant column" grouping of tuples if that grouping
+ * never ends up having future insertions that use the free space.  That
+ * effect is self-limiting; a future grouping that becomes the "nearest on the
+ * right" grouping of the affected grouping usually puts the extra free space
+ * to good use.  In general, it's important to avoid a pattern of pathological
+ * page splits that consistently do the wrong thing.
+ *
+ * Caller uses optimization when routine returns true, though the exact action
+ * taken by caller varies.  Caller uses original leaf page fillfactor in
+ * standard way rather than using the new item offset directly when *usemult
+ * was also set to true here.  Otherwise, caller applies optimization by
+ * locating the legal split point that makes the new tuple the very last tuple
+ * on the left side of the split.
+ */
+static bool
+_bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
+					int leaffillfactor, bool *usemult)
+{
+	int16		nkeyatts;
+	ItemId		itemid;
+	IndexTuple	tup;
+	int			keepnatts;
+
+	Assert(state->is_leaf && !state->is_rightmost);
+
+	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume leaffillfactor will be used by caller for now */
+	*usemult = true;
+
+	/* Single key indexes not considered here */
+	if (nkeyatts == 1)
+		return false;
+
+	/* Ascending insertion pattern never inferred when new item is first */
+	if (state->newitemoff == P_FIRSTKEY)
+		return false;
+
+	/*
+	 * Only apply optimization on pages with equisized tuples, since ordinal
+	 * keys are likely to be fixed-width.  Testing if the new tuple is
+	 * variable width directly might also work, but that fails to apply the
+	 * optimization to indexes with a numeric_ops attribute.
+	 *
+	 * Conclude that page has equisized tuples when the new item is the same
+	 * width as the smallest item observed during pass over page, and other
+	 * non-pivot tuples must be the same width as well.  (Note that the
+	 * possibly-truncated existing high key isn't counted in
+	 * olddataitemstotal, and must be subtracted from maxoff.)
+	 */
+	if (state->newitemsz != state->minfirstrightsz)
+		return false;
+	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
+		return false;
+
+	/*
+	 * Avoid applying optimization when tuples are wider than a tuple
+	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
+	 * int4/int32 attributes)
+	 */
+	if (state->newitemsz >
+		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
+		sizeof(ItemIdData))
+		return false;
+
+	/*
+	 * At least the first attribute's value must be equal to the corresponding
+	 * value in previous tuple to apply optimization.  New item cannot be a
+	 * duplicate, either.
+	 *
+	 * Handle case where new item is to the right of all items on the existing
+	 * page.  This is suggestive of monotonically increasing insertions in
+	 * itself, so the "heap TID adjacency" test is not applied here.
+	 * Concurrent insertions from backends associated with the same grouping
+	 * or sub-grouping should still have the optimization applied; if the
+	 * grouping is rather large, splits will consistently end up here.
+	 */
+	if (state->newitemoff > maxoff)
+	{
+		itemid = PageGetItemId(state->page, maxoff);
+		tup = (IndexTuple) PageGetItem(state->page, itemid);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+		if (keepnatts > 1 && keepnatts <= nkeyatts)
+			return true;
+
+		return false;
+	}
+
+	/*
+	 * "Low cardinality leading column, high cardinality suffix column"
+	 * indexes with a random insertion pattern (e.g., an index with a boolean
+	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
+	 * with a risk of consistently misapplying the optimization.  We're
+	 * willing to accept very occasional misapplication of the optimization,
+	 * provided the cases where we get it wrong are rare and self-limiting.
+	 *
+	 * Heap TID adjacency strongly suggests that the item just to the left was
+	 * inserted very recently, which prevents most misfirings.  Besides, all
+	 * inappropriate cases triggered at this point will still split in the
+	 * middle of the page on average.
+	 */
+	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
+	tup = (IndexTuple) PageGetItem(state->page, itemid);
+	/* Do cheaper test first */
+	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+		return false;
+	/* Check same conditions as rightmost item case, too */
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+
+	if (keepnatts > 1 && keepnatts <= nkeyatts)
+	{
+		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
+		double		leaffillfactormult = (double) leaffillfactor / 100.0;
+
+		/*
+		 * Don't allow caller to split after a new item when it will result in
+		 * a split point to the right of the point that a leaf fillfactor
+		 * split would use -- have caller apply leaf fillfactor instead
+		 */
+		if (interp <= leaffillfactormult)
+			*usemult = false;
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Subroutine for determining if two heap TIDS are "adjacent".
+ *
+ * Adjacent means that the high TID is very likely to have been inserted into
+ * heap relation immediately after the low TID, probably by the same
+ * transaction.
+ */
+static bool
+_bt_adjacenthtid(ItemPointer lowhtid, ItemPointer highhtid)
+{
+	BlockNumber lowblk,
+				highblk;
+
+	lowblk = ItemPointerGetBlockNumber(lowhtid);
+	highblk = ItemPointerGetBlockNumber(highhtid);
+
+	/* Make optimistic assumption of adjacency when heap blocks match */
+	if (lowblk == highblk)
+		return true;
+
+	/* When heap block one up, second offset should be FirstOffsetNumber */
+	if (lowblk + 1 == highblk &&
+		ItemPointerGetOffsetNumber(highhtid) == FirstOffsetNumber)
+		return true;
+
+	return false;
+}
+
 /*
  * Subroutine to find the "best" split point among an array of acceptable
  * candidate split points that split without there being an excessively high
-- 
2.17.1

v18-0001-Refactor-nbtree-insertion-scankeys.patchapplication/octet-stream; name=v18-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From c0eadf277c97c5eb2c4d15f25a2a88738bcf7271 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH v18 1/7] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache their binary
search in an ad-hoc manner.  This makes it easy to add a new
optimization: _bt_check_unique() now falls out of its loop immediately
in the common case where it's already clear that there couldn't possibly
be a duplicate.  More importantly, the new _bt_check_unique() scheme
makes it a lot easier to manage cached binary search effort afterwards,
from within _bt_findinsertloc().  This is needed for the upcoming patch
to make nbtree tuples unique by treating heap TID as a final tiebreaker
column.

Based on a suggestion by Andrey Lepikhov.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/README      |  29 +-
 src/backend/access/nbtree/nbtinsert.c | 422 +++++++++++++++-----------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 230 ++++++++++----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  96 ++----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  79 ++++-
 9 files changed, 562 insertions(+), 382 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index bb6442de82..5426bfd8d8 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -127,9 +127,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -139,14 +139,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -838,8 +838,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1030,7 +1030,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1082,7 +1082,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1111,11 +1111,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1303,8 +1304,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1317,8 +1318,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1423,8 +1424,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1864,13 +1864,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1883,13 +1882,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1905,14 +1903,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index b0b4ab8b76..a295a7a286 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -598,19 +598,22 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
-sk_func pointers point to btree comparison support functions (ie, 3-way
-comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0).  In an insertion scankey there is at most one
+entry per index column.  There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple.  (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2997b1111a..ed6b2692cb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,17 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel);
+static bool _bt_useduplicatepage(Relation rel, Relation heapRel,
+					 BTInsertState insertstate);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +81,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +108,26 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTInsertStateData insertstate;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
+
+	/*
+	 * Fill in the BTInsertState working area, to track the current page and
+	 * position within the page to insert on
+	 */
+	insertstate.itup = itup;
+	/* PageAddItem will MAXALIGN(), but be consistent */
+	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+	insertstate.itup_key = itup_key;
+	insertstate.bounds_valid = false;
+	insertstate.buf = InvalidBuffer;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,10 +150,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
-		Size		itemsz;
 		Page		page;
 		BTPageOpaque lpageop;
 
@@ -166,9 +170,6 @@ top:
 			page = BufferGetPage(buf);
 
 			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			itemsz = IndexTupleSize(itup);
-			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
-										 * but we need to be consistent */
 
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
@@ -177,10 +178,9 @@ top:
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
-				(PageGetFreeSpace(page) > itemsz) &&
+				(PageGetFreeSpace(page) > insertstate.itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,10 +219,12 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
+	insertstate.buf = buf;
+	buf = InvalidBuffer;		/* insertstate.buf now owns the buffer */
+
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
 	 * the index.
@@ -244,19 +246,19 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
-								 checkUnique, &is_unique, &speculativeToken);
+		xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+								 &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
 		{
 			/* Have to wait for the other guy ... */
-			_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, insertstate.buf);
+			insertstate.buf = InvalidBuffer;
 
 			/*
 			 * If it's a speculative insertion, wait for it to finish (ie. to
@@ -277,6 +279,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -286,22 +290,28 @@ top:
 		 * This reasoning also applies to INCLUDE indexes, whose extra
 		 * attributes are not considered part of the key space.
 		 */
-		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
+
+		/*
+		 * Do the insertion.  Note that insertstate contains cached binary
+		 * search bounds established within _bt_check_unique when insertion is
+		 * checkingunique.
+		 */
+		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+									   stack, heapRel);
+		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
 		/* just release the buffer */
-		_bt_relbuf(rel, buf);
+		_bt_relbuf(rel, insertstate.buf);
 	}
 
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +319,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +330,21 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in insertstate that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	IndexTuple	itup = insertstate->itup;
+	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -345,13 +356,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 	InitDirtySnapshot(SnapshotDirty);
 
-	page = BufferGetPage(buf);
+	page = BufferGetPage(insertstate->buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Find the first tuple with the same key.
+	 *
+	 * This also saves the binary search bounds in insertstate.  We use them
+	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
+	 */
+	offset = _bt_binsrch_insert(rel, insertstate);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,21 +384,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use cached search bounds to
+			 * limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply when the original page
+			 * is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the offset
+			 * where the key will go is not at the end of the page.
+			 */
+			if (nbuf == InvalidBuffer && offset == insertstate->stricthigh)
+			{
+				Assert(insertstate->bounds_valid);
+				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
+				Assert(insertstate->low <= insertstate->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive.  Just advance over killed
+			 * items as quickly as we can.  We only apply _bt_isequal() when
+			 * we get to a non-killed item.  Even those comparisons could be
+			 * avoided (in the common case where there is only one page to
+			 * visit) by reusing bounds, but just skipping dead items is
+			 * sufficiently effective.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +430,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -488,7 +527,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 * otherwise be masked by this unique constraint
 					 * violation.
 					 */
-					CheckForSerializableConflictIn(rel, NULL, buf);
+					CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
 
 					/*
 					 * This is a definite conflict.  Break the tuple down into
@@ -500,7 +539,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 */
 					if (nbuf != InvalidBuffer)
 						_bt_relbuf(rel, nbuf);
-					_bt_relbuf(rel, buf);
+					_bt_relbuf(rel, insertstate->buf);
+					insertstate->buf = InvalidBuffer;
 
 					{
 						Datum		values[INDEX_MAX_KEYS];
@@ -540,7 +580,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					if (nbuf != InvalidBuffer)
 						MarkBufferDirtyHint(nbuf, true);
 					else
-						MarkBufferDirtyHint(buf, true);
+						MarkBufferDirtyHint(insertstate->buf, true);
 				}
 			}
 		}
@@ -552,11 +592,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -600,57 +643,41 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, insertstate buffer contains the page that the new tuple
+ *		belongs on, which is exclusive-locked and pinned by caller.  This
+ *		won't be exactly the right page for some callers to insert on to.
+ *		They'll have to insert into a page somewhere to the right.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		On exit, insertstate buffer contains the chosen insertion page, and
+ *		the offset within that page is returned.  The lock and pin on the
+ *		original page are released in cases where initial page is not where
+ *		tuple belongs.  New buffer/page passed back to the caller is
+ *		exclusively locked and pinned, just like initial page was.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		_bt_check_unique() saves the progress of the binary search it
+ *		performs, also in the insertion state.  We don't need to do any
+ *		additional binary search comparisons here most of the time, provided
+ *		caller is to insert on original page.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		This is also where opportunistic microvacuuming of LP_DEAD tuples
+ *		occurs.  It convenient to make it happen here, since microvacuuming
+ *		may invalidate a _bt_check_unique() caller's cached binary search
+ *		work.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel)
 {
-	Buffer		buf = *bufptr;
-	Page		page = BufferGetPage(buf);
-	Size		itemsz;
+	BTScanInsert itup_key = insertstate->itup_key;
+	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	itemsz = IndexTupleSize(newtup);
-	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
-								 * need to be consistent */
-
 	/*
 	 * Check whether the item can fit on a btree page at all. (Eventually, we
 	 * ought to try to apply TOAST methods if not.) We actually need to be
@@ -660,11 +687,11 @@ _bt_findinsertloc(Relation rel,
 	 *
 	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
 	 */
-	if (itemsz > BTMaxItemSize(page))
+	if (insertstate->itemsz > BTMaxItemSize(page))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
+						insertstate->itemsz, BTMaxItemSize(page),
 						RelationGetRelationName(rel)),
 				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 						 "Consider a function index of an MD5 hash of the value, "
@@ -672,55 +699,38 @@ _bt_findinsertloc(Relation rel,
 				 errtableconstraint(heapRel,
 									RelationGetRelationName(rel))));
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!insertstate->bounds_valid || checkingunique);
+	for (;;)
 	{
+		int			cmpval;
 		Buffer		rbuf;
 		BlockNumber rblkno;
 
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * An earlier _bt_check_unique() call may well have established bounds
+		 * that we can use to skip the high key check for checkingunique
+		 * callers.  This fastpath cannot be used when there are no items on
+		 * the existing page (other than high key), or when it looks like the
+		 * new item belongs last on the page, but it might go on a later page
+		 * instead.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
-		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
+		if (insertstate->bounds_valid &&
+			insertstate->low <= insertstate->stricthigh &&
+			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
 
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
-				break;			/* OK, now we have enough space */
-		}
+		/* If this is the page that the tuple must go on, stop */
+		if (P_RIGHTMOST(lpageop))
+			break;
+		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * May have to handle case where there is a choice of which page to
+		 * place new tuple on, and we must balance space utilization as best
+		 * we can.  Note that this may invalidate cached bounds for us.
 		 */
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
 			break;
 
 		/*
@@ -761,29 +771,98 @@ _bt_findinsertloc(Relation rel,
 
 			rblkno = lpageop->btpo_next;
 		}
-		_bt_relbuf(rel, buf);
-		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
+		/* rbuf locked, local state set up; unlock buf, update other state */
+		_bt_relbuf(rel, insertstate->buf);
+		insertstate->buf = rbuf;
+		insertstate->bounds_valid = false;
 	}
 
-	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
-	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	*bufptr = buf;
-	*offsetptr = newitemoff;
+	/*
+	 * If the leaf page we're about to insert on doesn't have enough room for
+	 * the new tuple, a page split will be required.  If it looks like the
+	 * page has LP_DEAD items, try to make room by removing them.  It may
+	 * still be possible to avoid a split.
+	 */
+	if (P_HAS_GARBAGE(lpageop) && PageGetFreeSpace(page) < insertstate->itemsz)
+	{
+		_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+		insertstate->bounds_valid = false;
+	}
+
+	/* Find new item offset, possibly reusing earlier search bounds */
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
+
+	return newitemoff;
+}
+
+/*
+ *	_bt_useduplicatepage() -- Settle for this page of duplicates?
+ *
+ *		This function handles the question of whether or not an insertion of
+ *		a duplicate into an index should insert on the page contained in buf
+ *		when a choice must be made.
+ *
+ *		If the current page doesn't have enough free space for the new tuple
+ *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
+ *		will make enough room.
+ *
+ *		Returns true if caller should proceed with insert on buf's page.
+ *		Otherwise, caller should move on to the page to the right.
+ */
+static bool
+_bt_useduplicatepage(Relation rel, Relation heapRel, BTInsertState insertstate)
+{
+	Buffer		buf = insertstate->buf;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque lpageop;
+
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+
+	/* Easy case -- there is space free on this page already */
+	if (PageGetFreeSpace(page) >= insertstate->itemsz)
+		return true;
+
+	/*
+	 * Before considering moving right, see if we can obtain enough space by
+	 * erasing LP_DEAD items
+	 */
+	if (P_HAS_GARBAGE(lpageop))
+	{
+		_bt_vacuum_one_page(rel, buf, heapRel);
+		insertstate->bounds_valid = false;
+
+		if (PageGetFreeSpace(page) >= insertstate->itemsz)
+			return true;		/* OK, now we have enough space */
+	}
+
+	/*----------
+	 * It's now clear that _bt_findinsertloc() caller will need to split
+	 * the page if it is to insert new item on to it.  The choice to move
+	 * right to the next page remains open to it, but we should not search
+	 * for free space exhaustively when there are many pages to look through.
+	 *
+	 *	_bt_findinsertloc() keeps scanning right until it:
+	 *		(a) reaches the last page where the tuple can legally go
+	 *	Or until we:
+	 *		(b) find a page with enough free space, or
+	 *		(c) get tired of searching.
+	 * (c) is not flippant; it is important because if there are many
+	 * pages' worth of equal keys, it's better to split one of the early
+	 * pages than to scan all the way to the end of the run of equal keys
+	 * on every insert.  We implement "get tired" as a random choice,
+	 * since stopping after scanning a fixed number of pages wouldn't work
+	 * well (we'd never reach the right-hand side of previously split
+	 * pages).  The probability of moving right is set at 0.99, which may
+	 * seem too high to change the behavior much, but it does an excellent
+	 * job of preventing O(N^2) behavior with many equal keys.
+	 *----------
+	 */
+	return random() <= (MAX_RANDOM_VALUE / 100);
 }
 
 /*----------
@@ -2312,24 +2391,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95..56041c3d38 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b..0305469ad0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,6 +24,8 @@
 #include "utils/rel.h"
 
 
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -34,7 +36,6 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 					  ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
@@ -66,18 +67,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 	}
 }
 
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -348,12 +335,10 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-OffsetNumber
+static OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -375,7 +360,7 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
 		return low;
 
 	/*
@@ -392,7 +377,7 @@ _bt_binsrch(Relation rel,
 	 */
 	high++;						/* establish the loop invariant for high */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,7 +385,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -427,14 +412,120 @@ _bt_binsrch(Relation rel,
 	return OffsetNumberPrev(low);
 }
 
+/*
+ *
+ *	bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds.  Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on.  Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page).  The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time.
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+{
+	BTScanInsert key = insertstate->itup_key;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber low,
+				high,
+				stricthigh;
+	int32		result,
+				cmpval;
+
+	page = BufferGetPage(insertstate->buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque));
+	Assert(!key->nextkey);
+
+	if (!insertstate->bounds_valid)
+	{
+		/* Start new binary search */
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = insertstate->low;
+		high = insertstate->stricthigh;
+	}
+
+	/* If there are no keys on the page, return the first available slot */
+	if (unlikely(high < low))
+	{
+		/* Caller can't reuse bounds */
+		insertstate->low = InvalidOffsetNumber;
+		insertstate->stricthigh = InvalidOffsetNumber;
+		insertstate->bounds_valid = false;
+		return low;
+	}
+
+	/*
+	 * Binary search to find the first key on the page >= scan key.
+	 * (nextkey is always false when inserting).
+	 *
+	 * The loop invariant is: all slots before 'low' are < scan key, all
+	 * slots at or after 'high' are >= scan key.  'stricthigh' is > scan
+	 * key, and is maintained to save additional search effort for caller.
+	 *
+	 * We can fall out when high == low.
+	 */
+	if (!insertstate->bounds_valid)
+		high++;					/* establish the loop invariant for high */
+	stricthigh = high;			/* high initially strictly higher */
+
+	cmpval = 1;					/* !nextkey comparison value */
+
+	while (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		result = _bt_compare(rel, key, page, mid);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+		{
+			high = mid;
+			if (result != 0)
+				stricthigh = high;
+		}
+	}
+
+	/*
+	 * On a leaf page, a binary search always returns the first key >= scan
+	 * key (at least in !nextkey case), which could be the last slot + 1.
+	 * This is also the lower bound of cached search.
+	 *
+	 * stricthigh may also be the last slot + 1, which prevents caller from
+	 * using bounds directly, but is still useful to us if we're called a
+	 * second time with cached bounds (cached low will be < stricthigh when
+	 * that happens).
+	 */
+	insertstate->low = low;
+	insertstate->stricthigh = stricthigh;
+	insertstate->bounds_valid = true;
+
+	return low;
+}
+
 /*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -447,25 +538,26 @@ _bt_binsrch(Relation rel,
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
+ * scankey.  The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon.  This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key.  See backend/access/nbtree/README for details.
  *----------
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +580,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +667,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +914,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +976,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +1023,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +1044,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1147,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1184,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 363dceb5b1..a0e2e70cef 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -263,6 +263,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -540,6 +541,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1085,7 +1087,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1098,7 +1099,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1106,7 +1106,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1125,8 +1125,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e45..0250e089a6 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,37 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		Result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,7 +99,19 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Key arguments built when caller provides no tuple are
+		 * defensively represented as NULL values.  They should never be
+		 * used.
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -108,64 +123,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
-
-		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
-		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   (Datum) 0);
-	}
-
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2946b47b46..16bda5c586 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea790..feb66f8f1c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,65 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+typedef struct BTScanInsertData
+{
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+/*
+ * BTInsertStateData is the working area used during insertion.
+ *
+ * Filled-in after descending the tree to the first leaf page the new tuple
+ * might belong on.  Tracks the current position while performing uniqueness
+ * check, before it has been determined which exact page to insert on to.
+ *
+ * (This should be private to nbtinsert.c, but it's also used by
+ * _bt_binsrch_insert)
+ */
+typedef struct BTInsertStateData
+{
+	IndexTuple	itup;			/* Item we're inserting */
+	Size		itemsz;			/* Size of itup -- should be MAXALIGN()'d */
+	BTScanInsert itup_key;		/* Insertion scankey */
+
+	/* Buffer containing leaf page we're likely to insert itup on */
+	Buffer		buf;
+
+	/*
+	 * Cache of bounds within the current buffer.  Only used for insertions
+	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
+	 * _bt_findinsertloc for details.
+	 */
+	bool		bounds_valid;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+} BTInsertStateData;
+
+typedef BTInsertStateData *BTInsertState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +617,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +631,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.17.1

v18-0004-Allow-amcheck-to-re-find-tuples-using-new-search.patchapplication/octet-stream; name=v18-0004-Allow-amcheck-to-re-find-tuples-using-new-search.patchDownload

From 317a67e6b179ba45e6fe6de52a4556ee71653e36 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 31 Jan 2019 17:40:00 -0800
Subject: [PATCH v18 4/7] Allow amcheck to re-find tuples using new search.

Teach contrib/amcheck's bt_index_parent_check() function to take
advantage of the uniqueness property of heapkeyspace indexes in support
of a new verification option: non-pivot tuples (non-highkey tuples on
the leaf level) can optionally be re-found using a new search for each,
that starts from the root page.  If a tuple cannot be re-found, report
that the index is corrupt.

The new "rootdescend" verification option is exhaustive, and can
therefore make a call to bt_index_parent_check() take a lot longer.
Re-finding tuples during verification is mostly intended as an option
for backend developers, since the corruption scenarios that it alone is
uniquely capable of detecting seem fairly far-fetched.

For example, "rootdescend" verification is much more likely to detect
corruption of the least significant byte of a key from a pivot tuple in
the root page of a B-Tree that already has at least three levels.
Typically, only a few tuples on a cousin leaf page are at risk of
"getting overlooked" by index scans in this scenario.  The corrupt key
in the root page is only slightly corrupt: corrupt enough to give wrong
answers to some queries, and yet not corrupt enough to allow the problem
to be detected without verifying agreement between the leaf page and the
root page, skipping at least one internal page level.  The existing
bt_index_parent_check() checks never cross more than a single level.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-Wz=yTWnVu+HeHGKb2AGiADL9eprn-cKYAto4MkKOuiGtRQ@mail.gmail.com
---
 contrib/amcheck/Makefile                 |   2 +-
 contrib/amcheck/amcheck--1.1--1.2.sql    |  19 ++++
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check_btree.out |   5 +-
 contrib/amcheck/sql/check_btree.sql      |   5 +-
 contrib/amcheck/verify_nbtree.c          | 136 +++++++++++++++++++++--
 doc/src/sgml/amcheck.sgml                |   7 +-
 7 files changed, 160 insertions(+), 16 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.1--1.2.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index c5764b544f..dcec3b8520 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
new file mode 100644
index 0000000000..883530dec7
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -0,0 +1,19 @@
+/* contrib/amcheck/amcheck--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- create new, overloaded version of the 1.1 function signature
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+    heapallindexed boolean, rootdescend boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want this to be available to public
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 469048403d..c6e310046d 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 1e6079ddd2..6103aa4e47 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -126,7 +126,8 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 (1 row)
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and rootdescend
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -137,7 +138,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
  bt_index_parent_check 
 -----------------------
  
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 3f1e0d17ef..f3e365f4bd 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -78,7 +78,8 @@ INSERT INTO bttest_multi SELECT i, i%2  FROM generate_series(1, 100000) as i;
 SELECT bt_index_parent_check('bttest_multi_idx', true);
 
 --
--- Test for multilevel page deletion/downlink present checks
+-- Test for multilevel page deletion/downlink present checks, and rootdescend
+-- checks
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
@@ -89,7 +90,7 @@ VACUUM delete_test_table;
 -- root"
 DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
-SELECT bt_index_parent_check('delete_test_table_pkey', true);
+SELECT bt_index_parent_check('delete_test_table_pkey', true, true);
 
 --
 -- BUG #15597: must not assume consistent input toasting state when forming
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 4363e6b82e..6ae3bca953 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -75,6 +75,8 @@ typedef struct BtreeCheckState
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
 	bool		heapallindexed;
+	/* Also making sure non-pivot tuples can be found by new search? */
+	bool		rootdescend;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -124,10 +126,11 @@ PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
-						bool heapallindexed);
+						bool heapallindexed, bool rootdescend);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool heapkeyspace, bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed,
+					 bool rootdescend);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -140,6 +143,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  bool tupleIsAlive, void *checkstate);
 static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
+static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
@@ -177,7 +181,7 @@ bt_index_check(PG_FUNCTION_ARGS)
 	if (PG_NARGS() == 2)
 		heapallindexed = PG_GETARG_BOOL(1);
 
-	bt_index_check_internal(indrelid, false, heapallindexed);
+	bt_index_check_internal(indrelid, false, heapallindexed, false);
 
 	PG_RETURN_VOID();
 }
@@ -196,11 +200,14 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
 	bool		heapallindexed = false;
+	bool		rootdescend = false;
 
-	if (PG_NARGS() == 2)
+	if (PG_NARGS() >= 2)
 		heapallindexed = PG_GETARG_BOOL(1);
+	if (PG_NARGS() == 3)
+		rootdescend = PG_GETARG_BOOL(2);
 
-	bt_index_check_internal(indrelid, true, heapallindexed);
+	bt_index_check_internal(indrelid, true, heapallindexed, rootdescend);
 
 	PG_RETURN_VOID();
 }
@@ -209,7 +216,8 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
  * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
  */
 static void
-bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
+bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
+						bool rootdescend)
 {
 	Oid			heapid;
 	Relation	indrel;
@@ -267,7 +275,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	/* Check index, possibly against table it is an index on */
 	heapkeyspace = _bt_heapkeyspace(indrel);
 	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
-						 heapallindexed);
+						 heapallindexed, rootdescend);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -338,7 +346,7 @@ btree_index_checkable(Relation rel)
  */
 static void
 bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
-					 bool readonly, bool heapallindexed)
+					 bool readonly, bool heapallindexed, bool rootdescend)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -362,6 +370,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
+	state->rootdescend = rootdescend;
 
 	if (state->heapallindexed)
 	{
@@ -430,6 +439,14 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		}
 	}
 
+	Assert(!state->rootdescend || state->readonly);
+	if (state->rootdescend && !state->heapkeyspace)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify that tuples from index \"%s\" can each be found by an independent index search",
+						RelationGetRelationName(rel)),
+				 errhint("Only B-Tree version 4 indexes support rootdescend verification.")));
+
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -922,6 +939,31 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
+		/*
+		 * Readonly callers may optionally verify that non-pivot tuples can
+		 * each be found by an independent search that starts from the root
+		 */
+		if (state->rootdescend && P_ISLEAF(topaque) &&
+			!bt_rootdescend(state, itup))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("could not find tuple using search from root page in index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+										itid, htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1526,6 +1568,9 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 		 * internal pages.  In more general terms, a negative infinity item is
 		 * only negative infinity with respect to the subtree that the page is
 		 * at the root of.
+		 *
+		 * See also: bt_rootdescend(), which can even detect transitive
+		 * inconsistencies on cousin leaf pages.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
@@ -1926,6 +1971,81 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
 	return reformed;
 }
 
+/*
+ * Search for itup in index, starting from fast root page.  itup must be a
+ * non-pivot tuple.  This is only supported with heapkeyspace indexes, since
+ * we rely on having fully unique keys to find a match with only a signle
+ * visit to a leaf page, barring an interrupted page split, where we may have
+ * to move right.  (A concurrent page split is impossible because caller must
+ * be readonly caller.)
+ *
+ * This routine can detect very subtle transitive consistency issues across
+ * more than one level of the tree.  Leaf pages all have a high key (even the
+ * rightmost page has a conceptual positive infinity high key), but not a low
+ * key.  Their downlink in parent is a lower bound, which along with the high
+ * key is almost enough to detect every possible inconsistency.  A downlink
+ * separator key value won't always be available from parent, though, because
+ * the first items of internal pages are negative infinity items, truncated
+ * down to zero attributes during internal page splits.  While it's true that
+ * bt_downlink_check() and the high key check can detect most imaginable key
+ * space problems, there are remaining problems it won't detect with non-pivot
+ * tuples in cousin leaf pages.  Starting a search from the root for every
+ * existing leaf tuple detects small inconsistencies in upper levels of the
+ * tree that cannot be detected any other way.  (Besides all this, this is
+ * probably also useful as a direct test of the code used by index scans
+ * themselves.)
+ */
+static bool
+bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
+{
+	BTScanInsert key;
+	BTStack		stack;
+	Buffer		lbuf;
+	bool		exists;
+
+	key = _bt_mkscankey(state->rel, itup);
+	Assert(key->heapkeyspace && key->scantid != NULL);
+
+	/*
+	 * Search from root.
+	 *
+	 * Ideally, we would arrange to only move right within _bt_search() when
+	 * an interrupted page split is detected (i.e. when the incomplete split
+	 * bit is found to be set), but for now we accept the possibility that
+	 * that could conceal an inconsistency.
+	 */
+	Assert(state->readonly && state->rootdescend);
+	exists = false;
+	stack = _bt_search(state->rel, key, &lbuf, BT_READ, NULL);
+
+	if (BufferIsValid(lbuf))
+	{
+		BTInsertStateData insertstate;
+		OffsetNumber offnum;
+		Page		page;
+
+		insertstate.itup = itup;
+		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+		insertstate.itup_key = key;
+		insertstate.bounds_valid = false;
+		insertstate.buf = lbuf;
+
+		/* Get matching tuple on leaf page */
+		offnum = _bt_binsrch_insert(state->rel, &insertstate);
+		/* Compare first >= matching item on leaf page, if any */
+		page = BufferGetPage(lbuf);
+		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			_bt_compare(state->rel, key, page, offnum) == 0)
+			exists = true;
+		_bt_relbuf(state->rel, lbuf);
+	}
+
+	_bt_freestack(stack);
+	pfree(key);
+
+	return exists;
+}
+
 /*
  * Is particular offset within page (whose special state is passed by caller)
  * the page negative-infinity item?
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 8bb60d5c2d..627651d8d4 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -112,7 +112,7 @@ ORDER BY c.relpages DESC LIMIT 10;
 
    <varlistentry>
     <term>
-     <function>bt_index_parent_check(index regclass, heapallindexed boolean) returns void</function>
+     <function>bt_index_parent_check(index regclass, heapallindexed boolean, rootdescend boolean) returns void</function>
      <indexterm>
       <primary>bt_index_parent_check</primary>
      </indexterm>
@@ -126,7 +126,10 @@ ORDER BY c.relpages DESC LIMIT 10;
       argument is <literal>true</literal>, the function verifies the
       presence of all heap tuples that should be found within the
       index, and that there are no missing downlinks in the index
-      structure.  The checks that can be performed by
+      structure.  When the optional <parameter>rootdescend</parameter>
+      argument is <literal>true</literal>, verification re-finds
+      tuples on the leaf level by performing a new search from the
+      root page for each tuple.  The checks that can be performed by
       <function>bt_index_parent_check</function> are a superset of the
       checks that can be performed by <function>bt_index_check</function>.
       <function>bt_index_parent_check</function> can be thought of as
-- 
2.17.1

v18-0003-Consider-secondary-factors-during-nbtree-splits.patchapplication/octet-stream; name=v18-0003-Consider-secondary-factors-during-nbtree-splits.patchDownload

From 079f020cd50626f05fe45c3fc4813152f915c0d6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Oct 2018 15:51:53 -0700
Subject: [PATCH v18 3/7] Consider secondary factors during nbtree splits.

Teach nbtree to give some consideration to how "distinguishing"
candidate leaf page split points are.  This should not noticeably affect
the balance of free space within each half of the split, while still
making suffix truncation truncate away significantly more attributes on
average.

The logic for choosing a leaf split point now uses a fallback mode in
the case where the page is full of duplicates and it isn't possible to
find even a minimally distinguishing split point.  When the page is full
of duplicates, the split should pack the left half very tightly, while
leaving the right half mostly empty.  Our assumption is that logical
duplicates will almost always be inserted in ascending heap TID order
with v4 indexes.  This strategy leaves most of the free space on the
half of the split that will likely be where future logical duplicates of
the same value need to be placed.

The number of cycles added is not very noticeable.  This is important
because deciding on a split point takes place while at least one
exclusive buffer lock is held.  We avoid using authoritative insertion
scankey comparisons to save cycles, unlike suffix truncation proper.  We
use a faster binary comparison instead.

Note that even pg_upgrade'd v3 indexes make use of these optimizations.
Benchmarking has shown that even v3 indexes benefit, despite the fact
that suffix truncation will only truncate non-key attributes in INCLUDE
indexes.  Grouping relatively similar tuples together is beneficial in
and of itself, since it reduces the number of leaf pages that must be
accessed by subsequent index scans.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzmmoLNQOj9mAD78iQHfWLJDszHEDrAzGTUMG3mVh5xWPw@mail.gmail.com
---
 src/backend/access/nbtree/Makefile      |   2 +-
 src/backend/access/nbtree/README        |  47 +-
 src/backend/access/nbtree/nbtinsert.c   | 287 +-------
 src/backend/access/nbtree/nbtsplitloc.c | 846 ++++++++++++++++++++++++
 src/backend/access/nbtree/nbtutils.c    |  55 ++
 src/include/access/nbtree.h             |  15 +-
 6 files changed, 961 insertions(+), 291 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtsplitloc.c

diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bbb21d235c..9aab9cf64a 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
-       nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
+       nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40ff25fe06..ca4fdf7ac4 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -155,9 +155,9 @@ Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
-each of the resulting pages.  Note we must include the incoming item in
-this calculation, otherwise it is possible to find that the incoming
-item doesn't fit on the split page where it needs to go!
+pages (though suffix truncation is also considered).  Note we must include
+the incoming item in this calculation, otherwise it is possible to find
+that the incoming item doesn't fit on the split page where it needs to go!
 
 The Deletion Algorithm
 ----------------------
@@ -661,6 +661,47 @@ variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.
 
+There is sophisticated criteria for choosing a leaf page split point.  The
+general idea is to make suffix truncation effective without unduly
+influencing the balance of space for each half of the page split.  The
+choice of leaf split point can be thought of as a choice among points
+*between* items on the page to be split, at least if you pretend that the
+incoming tuple was placed on the page already (you have to pretend because
+there won't actually be enough space for it on the page).  Choosing the
+split point between two index tuples where the first non-equal attribute
+appears as early as possible results in truncating away as many suffix
+attributes as possible.  Evenly balancing space among each half of the
+split is usually the first concern, but even small adjustments in the
+precise split point can allow truncation to be far more effective.
+
+Suffix truncation is primarily valuable because it makes pivot tuples
+smaller, which delays splits of internal pages, but that isn't the only
+reason why it's effective.  Even truncation that doesn't make pivot tuples
+smaller due to alignment still prevents pivot tuples from being more
+restrictive than truly necessary in how they describe which values belong
+on which pages.
+
+While it's not possible to correctly perform suffix truncation during
+internal page splits, it's still useful to be discriminating when splitting
+an internal page.  The split point that implies a downlink be inserted in
+the parent that's the smallest one available within an acceptable range of
+the fillfactor-wise optimal split point is chosen.  This idea also comes
+from the Prefix B-Tree paper.  This process has much in common with to what
+happens at the leaf level to make suffix truncation effective.  The overall
+effect is that suffix truncation tends to produce smaller, more
+discriminating pivot tuples, especially early in the lifetime of the index,
+while biasing internal page splits makes the earlier, smaller pivot tuples
+end up in the root page, delaying root page splits.
+
+Logical duplicates are given special consideration.  The logic for
+selecting a split point goes to great lengths to avoid having duplicates
+span more than one page, and almost always manages to pick a split point
+between two user-key-distinct tuples, accepting a completely lopsided split
+if it must.  When a page that's already full of duplicates must be split,
+the fallback strategy assumes that duplicates are mostly inserted in
+ascending heap TID order.  The page is split in a way that leaves the left
+half of the page mostly full, and the right half of the page mostly empty.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d28d18496f..022e8d7d84 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,26 +28,6 @@
 /* Minimum tree height for application of fastpath optimization */
 #define BTREE_FASTPATH_MIN_LEVEL	2
 
-typedef struct
-{
-	/* context data for _bt_checksplitloc */
-	Size		newitemsz;		/* size of new item to be inserted */
-	int			fillfactor;		/* needed when splitting rightmost page */
-	bool		is_leaf;		/* T if splitting a leaf page */
-	bool		is_rightmost;	/* T if splitting a rightmost page */
-	OffsetNumber newitemoff;	/* where the new item is to be inserted */
-	int			leftspace;		/* space available for items on left page */
-	int			rightspace;		/* space available for items on right page */
-	int			olddataitemstotal;	/* space taken by old items */
-
-	bool		have_split;		/* found a valid split? */
-
-	/* these fields valid only if have_split is true */
-	bool		newitemonleft;	/* new item on left or right of best split */
-	OffsetNumber firstright;	/* best split point */
-	int			best_delta;		/* best size delta so far */
-} FindSplitData;
-
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
@@ -74,13 +54,6 @@ static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
-static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft);
-static void _bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright, bool newitemonleft,
-				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
@@ -1011,7 +984,7 @@ _bt_insertonpg(Relation rel,
 
 		/* Choose the split point */
 		firstright = _bt_findsplitloc(rel, page,
-									  newitemoff, itemsz,
+									  newitemoff, itemsz, itup,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
@@ -1695,264 +1668,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	return rbuf;
 }
 
-/*
- *	_bt_findsplitloc() -- find an appropriate place to split a page.
- *
- * The idea here is to equalize the free space that will be on each split
- * page, *after accounting for the inserted tuple*.  (If we fail to account
- * for it, we might find ourselves with too little room on the page that
- * it needs to go into!)
- *
- * If the page is the rightmost page on its level, we instead try to arrange
- * to leave the left split page fillfactor% full.  In this way, when we are
- * inserting successively increasing keys (consider sequences, timestamps,
- * etc) we will end up with a tree whose pages are about fillfactor% full,
- * instead of the 50% full result that we'd get without this special case.
- * This is the same as nbtsort.c produces for a newly-created tree.  Note
- * that leaf and nonleaf pages use different fillfactors.
- *
- * We are passed the intended insert position of the new tuple, expressed as
- * the offsetnumber of the tuple it must go in front of.  (This could be
- * maxoff+1 if the tuple is to go at the end.)
- *
- * We return the index of the first existing tuple that should go on the
- * righthand page, plus a boolean indicating whether the new tuple goes on
- * the left or right page.  The bool is necessary to disambiguate the case
- * where firstright == newitemoff.
- */
-static OffsetNumber
-_bt_findsplitloc(Relation rel,
-				 Page page,
-				 OffsetNumber newitemoff,
-				 Size newitemsz,
-				 bool *newitemonleft)
-{
-	BTPageOpaque opaque;
-	OffsetNumber offnum;
-	OffsetNumber maxoff;
-	ItemId		itemid;
-	FindSplitData state;
-	int			leftspace,
-				rightspace,
-				goodenough,
-				olddataitemstotal,
-				olddataitemstoleft;
-	bool		goodenoughfound;
-
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
-	newitemsz += sizeof(ItemIdData);
-
-	/* Total free space available on a btree page, after fixed overhead */
-	leftspace = rightspace =
-		PageGetPageSize(page) - SizeOfPageHeaderData -
-		MAXALIGN(sizeof(BTPageOpaqueData));
-
-	/* The right page will have the same high key as the old page */
-	if (!P_RIGHTMOST(opaque))
-	{
-		itemid = PageGetItemId(page, P_HIKEY);
-		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
-							 sizeof(ItemIdData));
-	}
-
-	/* Count up total space in data items without actually scanning 'em */
-	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
-
-	state.newitemsz = newitemsz;
-	state.is_leaf = P_ISLEAF(opaque);
-	state.is_rightmost = P_RIGHTMOST(opaque);
-	state.have_split = false;
-	if (state.is_leaf)
-		state.fillfactor = RelationGetFillFactor(rel,
-												 BTREE_DEFAULT_FILLFACTOR);
-	else
-		state.fillfactor = BTREE_NONLEAF_FILLFACTOR;
-	state.newitemonleft = false;	/* these just to keep compiler quiet */
-	state.firstright = 0;
-	state.best_delta = 0;
-	state.leftspace = leftspace;
-	state.rightspace = rightspace;
-	state.olddataitemstotal = olddataitemstotal;
-	state.newitemoff = newitemoff;
-
-	/*
-	 * Finding the best possible split would require checking all the possible
-	 * split points, because of the high-key and left-key special cases.
-	 * That's probably more work than it's worth; instead, stop as soon as we
-	 * find a "good-enough" split, where good-enough is defined as an
-	 * imbalance in free space of no more than pagesize/16 (arbitrary...) This
-	 * should let us stop near the middle on most pages, instead of plowing to
-	 * the end.
-	 */
-	goodenough = leftspace / 16;
-
-	/*
-	 * Scan through the data items and calculate space usage for a split at
-	 * each possible position.
-	 */
-	olddataitemstoleft = 0;
-	goodenoughfound = false;
-	maxoff = PageGetMaxOffsetNumber(page);
-
-	for (offnum = P_FIRSTDATAKEY(opaque);
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		Size		itemsz;
-
-		itemid = PageGetItemId(page, offnum);
-		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
-
-		/*
-		 * Will the new item go to left or right of split?
-		 */
-		if (offnum > newitemoff)
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-		else if (offnum < newitemoff)
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		else
-		{
-			/* need to try it both ways! */
-			_bt_checksplitloc(&state, offnum, true,
-							  olddataitemstoleft, itemsz);
-
-			_bt_checksplitloc(&state, offnum, false,
-							  olddataitemstoleft, itemsz);
-		}
-
-		/* Abort scan once we find a good-enough choice */
-		if (state.have_split && state.best_delta <= goodenough)
-		{
-			goodenoughfound = true;
-			break;
-		}
-
-		olddataitemstoleft += itemsz;
-	}
-
-	/*
-	 * If the new item goes as the last item, check for splitting so that all
-	 * the old items go to the left page and the new item goes to the right
-	 * page.
-	 */
-	if (newitemoff > maxoff && !goodenoughfound)
-		_bt_checksplitloc(&state, newitemoff, false, olddataitemstotal, 0);
-
-	/*
-	 * I believe it is not possible to fail to find a feasible split, but just
-	 * in case ...
-	 */
-	if (!state.have_split)
-		elog(ERROR, "could not find a feasible split point for index \"%s\"",
-			 RelationGetRelationName(rel));
-
-	*newitemonleft = state.newitemonleft;
-	return state.firstright;
-}
-
-/*
- * Subroutine to analyze a particular possible split choice (ie, firstright
- * and newitemonleft settings), and record the best split so far in *state.
- *
- * firstoldonright is the offset of the first item on the original page
- * that goes to the right page, and firstoldonrightsz is the size of that
- * tuple. firstoldonright can be > max offset, which means that all the old
- * items go to the left page and only the new item goes to the right page.
- * In that case, firstoldonrightsz is not used.
- *
- * olddataitemstoleft is the total size of all old items to the left of
- * firstoldonright.
- */
-static void
-_bt_checksplitloc(FindSplitData *state,
-				  OffsetNumber firstoldonright,
-				  bool newitemonleft,
-				  int olddataitemstoleft,
-				  Size firstoldonrightsz)
-{
-	int			leftfree,
-				rightfree;
-	Size		firstrightitemsz;
-	bool		newitemisfirstonright;
-
-	/* Is the new item going to be the first item on the right page? */
-	newitemisfirstonright = (firstoldonright == state->newitemoff
-							 && !newitemonleft);
-
-	if (newitemisfirstonright)
-		firstrightitemsz = state->newitemsz;
-	else
-		firstrightitemsz = firstoldonrightsz;
-
-	/* Account for all the old tuples */
-	leftfree = state->leftspace - olddataitemstoleft;
-	rightfree = state->rightspace -
-		(state->olddataitemstotal - olddataitemstoleft);
-
-	/*
-	 * The first item on the right page becomes the high key of the left page;
-	 * therefore it counts against left space as well as right space. When
-	 * index has included attributes, then those attributes of left page high
-	 * key will be truncated leaving that page with slightly more free space.
-	 * However, that shouldn't affect our ability to find valid split
-	 * location, because anyway split location should exists even without high
-	 * key truncation.
-	 */
-	leftfree -= firstrightitemsz;
-
-	/* account for the new item */
-	if (newitemonleft)
-		leftfree -= (int) state->newitemsz;
-	else
-		rightfree -= (int) state->newitemsz;
-
-	/*
-	 * If we are not on the leaf level, we will be able to discard the key
-	 * data from the first item that winds up on the right page.
-	 */
-	if (!state->is_leaf)
-		rightfree += (int) firstrightitemsz -
-			(int) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
-
-	/*
-	 * If feasible split point, remember best delta.
-	 */
-	if (leftfree >= 0 && rightfree >= 0)
-	{
-		int			delta;
-
-		if (state->is_rightmost)
-		{
-			/*
-			 * If splitting a rightmost page, try to put (100-fillfactor)% of
-			 * free space on left page. See comments for _bt_findsplitloc.
-			 */
-			delta = (state->fillfactor * leftfree)
-				- ((100 - state->fillfactor) * rightfree);
-		}
-		else
-		{
-			/* Otherwise, aim for equal free space on both sides */
-			delta = leftfree - rightfree;
-		}
-
-		if (delta < 0)
-			delta = -delta;
-		if (!state->have_split || delta < state->best_delta)
-		{
-			state->have_split = true;
-			state->newitemonleft = newitemonleft;
-			state->firstright = firstoldonright;
-			state->best_delta = delta;
-		}
-	}
-}
-
 /*
  * _bt_insert_parent() -- Insert downlink into parent after a page split.
  *
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
new file mode 100644
index 0000000000..23cc593dc8
--- /dev/null
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -0,0 +1,846 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtsplitloc.c
+ *	  Choose split point code for Postgres btree implementation.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtsplitloc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "storage/lmgr.h"
+
+/* limits on split interval (default strategy only) */
+#define MAX_LEAF_INTERVAL			9
+#define MAX_INTERNAL_INTERVAL		18
+
+typedef enum
+{
+	/* strategy for searching through materialized list of split points */
+	SPLIT_DEFAULT,				/* give some weight to truncation */
+	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
+	SPLIT_SINGLE_VALUE			/* leave left page almost full */
+} FindSplitStrat;
+
+typedef struct
+{
+	/* details of free space left by split */
+	int16		curdelta;		/* current leftfree/rightfree delta */
+	int16		leftfree;		/* space left on left page post-split */
+	int16		rightfree;		/* space left on right page post-split */
+
+	/* split point identifying fields (returned by _bt_findsplitloc) */
+	OffsetNumber firstoldonright;	/* first item on new right page */
+	bool		newitemonleft;	/* new item goes on left, or right? */
+
+} SplitPoint;
+
+typedef struct
+{
+	/* context data for _bt_recsplitloc */
+	Relation	rel;			/* index relation */
+	Page		page;			/* page undergoing split */
+	IndexTuple	newitem;		/* new item (cause of page split) */
+	Size		newitemsz;		/* size of newitem (includes line pointer) */
+	bool		is_leaf;		/* T if splitting a leaf page */
+	bool		is_rightmost;	/* T if splitting rightmost page on level */
+	OffsetNumber newitemoff;	/* where the new item is to be inserted */
+	int			leftspace;		/* space available for items on left page */
+	int			rightspace;		/* space available for items on right page */
+	int			olddataitemstotal;	/* space taken by old items */
+	Size		minfirstrightsz;	/* smallest firstoldonright tuple size */
+
+	/* candidate split point data */
+	int			maxsplits;		/* maximum number of splits */
+	int			nsplits;		/* current number of splits */
+	SplitPoint *splits;			/* all candidate split points for page */
+	int			interval;		/* current range of acceptable split points */
+} FindSplitData;
+
+static void _bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright, bool newitemonleft,
+				int olddataitemstoleft, Size firstoldonrightsz);
+static void _bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult);
+static int	_bt_splitcmp(const void *arg1, const void *arg2);
+static OffsetNumber _bt_bestsplitloc(FindSplitData *state, int perfectpenalty,
+				 bool *newitemonleft);
+static int _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy);
+static void _bt_interval_edges(FindSplitData *state,
+				   SplitPoint **leftinterval, SplitPoint **rightinterval);
+static inline int _bt_split_penalty(FindSplitData *state, SplitPoint *split);
+static inline IndexTuple _bt_split_lastleft(FindSplitData *state,
+				   SplitPoint *split);
+static inline IndexTuple _bt_split_firstright(FindSplitData *state,
+					 SplitPoint *split);
+
+
+/*
+ *	_bt_findsplitloc() -- find an appropriate place to split a page.
+ *
+ * The main goal here is to equalize the free space that will be on each
+ * split page, *after accounting for the inserted tuple*.  (If we fail to
+ * account for it, we might find ourselves with too little room on the page
+ * that it needs to go into!)
+ *
+ * If the page is the rightmost page on its level, we instead try to arrange
+ * to leave the left split page fillfactor% full.  In this way, when we are
+ * inserting successively increasing keys (consider sequences, timestamps,
+ * etc) we will end up with a tree whose pages are about fillfactor% full,
+ * instead of the 50% full result that we'd get without this special case.
+ * This is the same as nbtsort.c produces for a newly-created tree.  Note
+ * that leaf and nonleaf pages use different fillfactors.  Note also that
+ * there are a number of further special cases where fillfactor is not
+ * applied in the standard way.
+ *
+ * We are passed the intended insert position of the new tuple, expressed as
+ * the offsetnumber of the tuple it must go in front of (this could be
+ * maxoff+1 if the tuple is to go at the end).  The new tuple itself is also
+ * passed, since it's needed to give some weight to how effective suffix
+ * truncation will be.  The implementation picks the split point that
+ * maximizes the effectiveness of suffix truncation from a small list of
+ * alternative candidate split points that leave each side of the split with
+ * about the same share of free space.  Suffix truncation is secondary to
+ * equalizing free space, except in cases with large numbers of duplicates.
+ * Note that it is always assumed that caller goes on to perform truncation,
+ * even with pg_upgrade'd indexes where that isn't actually the case
+ * (!heapkeyspace indexes).  See nbtree/README for more information about
+ * suffix truncation.
+ *
+ * We return the index of the first existing tuple that should go on the
+ * righthand page, plus a boolean indicating whether the new tuple goes on
+ * the left or right page.  The bool is necessary to disambiguate the case
+ * where firstright == newitemoff.
+ */
+OffsetNumber
+_bt_findsplitloc(Relation rel,
+				 Page page,
+				 OffsetNumber newitemoff,
+				 Size newitemsz,
+				 IndexTuple newitem,
+				 bool *newitemonleft)
+{
+	BTPageOpaque opaque;
+	int			leftspace,
+				rightspace,
+				olddataitemstotal,
+				olddataitemstoleft,
+				perfectpenalty,
+				leaffillfactor;
+	FindSplitData state;
+	FindSplitStrat strategy;
+	ItemId		itemid;
+	OffsetNumber offnum,
+				maxoff,
+				foundfirstright;
+	double		fillfactormult;
+	bool		usemult;
+	SplitPoint	leftpage,
+				rightpage;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Total free space available on a btree page, after fixed overhead */
+	leftspace = rightspace =
+		PageGetPageSize(page) - SizeOfPageHeaderData -
+		MAXALIGN(sizeof(BTPageOpaqueData));
+
+	/* The right page will have the same high key as the old page */
+	if (!P_RIGHTMOST(opaque))
+	{
+		itemid = PageGetItemId(page, P_HIKEY);
+		rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
+							 sizeof(ItemIdData));
+	}
+
+	/* Count up total space in data items before actually scanning 'em */
+	olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
+	leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+
+	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+	newitemsz += sizeof(ItemIdData);
+	state.rel = rel;
+	state.page = page;
+	state.newitem = newitem;
+	state.newitemsz = newitemsz;
+	state.is_leaf = P_ISLEAF(opaque);
+	state.is_rightmost = P_RIGHTMOST(opaque);
+	state.leftspace = leftspace;
+	state.rightspace = rightspace;
+	state.olddataitemstotal = olddataitemstotal;
+	state.minfirstrightsz = SIZE_MAX;
+	state.newitemoff = newitemoff;
+
+	/*
+	 * maxsplits should never exceed maxoff because there will be at most as
+	 * many candidate split points as there are points _between_ tuples, once
+	 * you imagine that the new item is already on the original page (the
+	 * final number of splits may be slightly lower because not all points
+	 * between tuples will be legal).
+	 */
+	state.maxsplits = maxoff;
+	state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
+	state.nsplits = 0;
+
+	/*
+	 * Scan through the data items and calculate space usage for a split at
+	 * each possible position.  We start at the first data offset rather than
+	 * the second data offset to handle the "newitemoff == first data offset"
+	 * case (any other split whose firstoldonright is the first data offset
+	 * can't be legal, though, and so won't actually end up being recorded in
+	 * first loop iteration).
+	 */
+	olddataitemstoleft = 0;
+
+	for (offnum = P_FIRSTDATAKEY(opaque);
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		Size		itemsz;
+
+		itemid = PageGetItemId(page, offnum);
+		itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+		/*
+		 * Will the new item go to left or right of split?
+		 */
+		if (offnum > newitemoff)
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+		else if (offnum < newitemoff)
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		else
+		{
+			/* may need to record a split on one or both sides of new item */
+			_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
+			_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
+		}
+
+		olddataitemstoleft += itemsz;
+	}
+
+	/*
+	 * If the new item goes as the last item, record the split point that
+	 * leaves all the old items on the left page, and the new item on the
+	 * right page.  This is required because a split that leaves the new item
+	 * as the firstoldonright won't have been reached within the loop.
+	 */
+	Assert(olddataitemstoleft == olddataitemstotal);
+	if (newitemoff > maxoff)
+		_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
+
+	/*
+	 * I believe it is not possible to fail to find a feasible split, but just
+	 * in case ...
+	 */
+	if (state.nsplits == 0)
+		elog(ERROR, "could not find a feasible split point for index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	/*
+	 * Start search for a split point among list of legal split points.  Give
+	 * primary consideration to equalizing available free space in each half
+	 * of the split initially (start with default strategy), while applying
+	 * rightmost where appropriate.  Either of the two other fallback
+	 * strategies may be required for cases with a large number of duplicates
+	 * around the original/space-optimal split point.
+	 *
+	 * Default strategy gives some weight to suffix truncation in deciding a
+	 * split point on leaf pages.  It attempts to select a split point where a
+	 * distinguishing attribute appears earlier in the new high key for the
+	 * left side of the split, in order to maximize the number of trailing
+	 * attributes that can be truncated away.  Only candidate split points
+	 * that imply an acceptable balance of free space on each side are
+	 * considered.
+	 */
+	if (!state.is_leaf)
+	{
+		/* fillfactormult only used on rightmost page */
+		usemult = state.is_rightmost;
+		fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
+	}
+	else if (state.is_rightmost)
+	{
+		/* Rightmost leaf page --  fillfactormult always used */
+		usemult = true;
+		fillfactormult = leaffillfactor / 100.0;
+	}
+	else
+	{
+		/* Other leaf page.  50:50 page split. */
+		usemult = false;
+		/* fillfactormult not used, but be tidy */
+		fillfactormult = 0.50;
+	}
+
+	/*
+	 * Set an initial limit on the split interval/number of candidate split
+	 * points as appropriate.  The "Prefix B-Trees" paper refers to this as
+	 * sigma l for leaf splits and sigma b for internal ("branch") splits.
+	 * It's hard to provide a theoretical justification for the initial size
+	 * of the split interval, though it's clear that a small split interval
+	 * makes suffix truncation much more effective without noticeably
+	 * affecting space utilization over time.
+	 */
+	state.interval = Min(Max(1, state.nsplits * 0.05),
+						 state.is_leaf ? MAX_LEAF_INTERVAL :
+						 MAX_INTERNAL_INTERVAL);
+
+	/*
+	 * Save leftmost and rightmost splits for page before original ordinal
+	 * sort order is lost by delta/fillfactormult sort
+	 */
+	leftpage = state.splits[0];
+	rightpage = state.splits[state.nsplits - 1];
+
+	/* Give split points a fillfactormult-wise delta, and sort on deltas */
+	_bt_deltasortsplits(&state, fillfactormult, usemult);
+
+	/*
+	 * Determine if default strategy/split interval will produce a
+	 * sufficiently distinguishing split, or if we should change strategies.
+	 * Alternative strategies change the range of split points that are
+	 * considered acceptable (split interval), and possibly change
+	 * fillfactormult, in order to deal with pages with a large number of
+	 * duplicates gracefully.
+	 *
+	 * Pass low and high splits for the entire page (including even newitem).
+	 * These are used when the initial split interval encloses split points
+	 * that are full of duplicates, and we need to consider if it's even
+	 * possible to avoid appending a heap TID.
+	 */
+	perfectpenalty = _bt_strategy(&state, &leftpage, &rightpage, &strategy);
+
+	if (strategy == SPLIT_DEFAULT)
+	{
+		/*
+		 * Default strategy worked out (always works out with internal page).
+		 * Original split interval still stands.
+		 */
+	}
+
+	/*
+	 * Many duplicates strategy is used when a heap TID would otherwise be
+	 * appended, but the page isn't completely full of logical duplicates.
+	 *
+	 * The split interval is widened to include all legal candidate split
+	 * points.  There may be a few as two distinct values in the whole-page
+	 * split interval.  Many duplicates strategy has no hard requirements for
+	 * space utilization, though it still keeps the use of space balanced as a
+	 * non-binding secondary goal (perfect penalty is set so that the
+	 * first/lowest delta split points that avoids appending a heap TID is
+	 * used).
+	 *
+	 * Single value strategy is used when it is impossible to avoid appending
+	 * a heap TID.  It arranges to leave the left page very full.  This
+	 * maximizes space utilization in cases where tuples with the same
+	 * attribute values span many pages.  Newly inserted duplicates will tend
+	 * to have higher heap TID values, so we'll end up splitting to the right
+	 * consistently.  (Single value strategy is harmless though not
+	 * particularly useful with !heapkeyspace indexes.)
+	 */
+	else if (strategy == SPLIT_MANY_DUPLICATES)
+	{
+		Assert(state.is_leaf);
+		/* No need to resort splits -- no change in fillfactormult/deltas */
+		state.interval = state.nsplits;
+	}
+	else if (strategy == SPLIT_SINGLE_VALUE)
+	{
+		Assert(state.is_leaf);
+		/* Split near the end of the page */
+		usemult = true;
+		fillfactormult = BTREE_SINGLEVAL_FILLFACTOR / 100.0;
+		/* Resort split points with new delta */
+		_bt_deltasortsplits(&state, fillfactormult, usemult);
+		/* Appending a heap TID is unavoidable, so interval of 1 is fine */
+		state.interval = 1;
+	}
+
+	/*
+	 * Search among acceptable split points (using final split interval) for
+	 * the entry that has the lowest penalty, and is therefore expected to
+	 * maximize fan-out.  Sets *newitemonleft for us.
+	 */
+	foundfirstright = _bt_bestsplitloc(&state, perfectpenalty, newitemonleft);
+	pfree(state.splits);
+
+	return foundfirstright;
+}
+
+/*
+ * Subroutine to record a particular point between two tuples (possibly the
+ * new item) on page (ie, combination of firstright and newitemonleft
+ * settings) in *state for later analysis.  This is also a convenient point
+ * to check if the split is legal (if it isn't, it won't be recorded).
+ *
+ * firstoldonright is the offset of the first item on the original page that
+ * goes to the right page, and firstoldonrightsz is the size of that tuple.
+ * firstoldonright can be > max offset, which means that all the old items go
+ * to the left page and only the new item goes to the right page.  In that
+ * case, firstoldonrightsz is not used.
+ *
+ * olddataitemstoleft is the total size of all old items to the left of the
+ * split point that is recorded here when legal.  Should not include
+ * newitemsz, since that is handled here.
+ */
+static void
+_bt_recsplitloc(FindSplitData *state,
+				OffsetNumber firstoldonright,
+				bool newitemonleft,
+				int olddataitemstoleft,
+				Size firstoldonrightsz)
+{
+	int16		leftfree,
+				rightfree;
+	Size		firstrightitemsz;
+	bool		newitemisfirstonright;
+
+	/* Is the new item going to be the first item on the right page? */
+	newitemisfirstonright = (firstoldonright == state->newitemoff
+							 && !newitemonleft);
+
+	if (newitemisfirstonright)
+		firstrightitemsz = state->newitemsz;
+	else
+		firstrightitemsz = firstoldonrightsz;
+
+	/* Account for all the old tuples */
+	leftfree = state->leftspace - olddataitemstoleft;
+	rightfree = state->rightspace -
+		(state->olddataitemstotal - olddataitemstoleft);
+
+	/*
+	 * The first item on the right page becomes the high key of the left page;
+	 * therefore it counts against left space as well as right space (we
+	 * cannot assume that suffix truncation will make it any smaller).  When
+	 * index has included attributes, then those attributes of left page high
+	 * key will be truncated leaving that page with slightly more free space.
+	 * However, that shouldn't affect our ability to find valid split
+	 * location, since we err in the direction of being pessimistic about free
+	 * space on the left half.  Besides, even when suffix truncation of
+	 * non-TID attributes occurs, the new high key often won't even be a
+	 * single MAXALIGN() quantum smaller than the firstright tuple it's based
+	 * on.
+	 *
+	 * If we are on the leaf level, assume that suffix truncation cannot avoid
+	 * adding a heap TID to the left half's new high key when splitting at the
+	 * leaf level.  In practice the new high key will often be smaller and
+	 * will rarely be larger, but conservatively assume the worst case.
+	 */
+	if (state->is_leaf)
+		leftfree -= (int16) (firstrightitemsz +
+							 MAXALIGN(sizeof(ItemPointerData)));
+	else
+		leftfree -= (int16) firstrightitemsz;
+
+	/* account for the new item */
+	if (newitemonleft)
+		leftfree -= (int16) state->newitemsz;
+	else
+		rightfree -= (int16) state->newitemsz;
+
+	/*
+	 * If we are not on the leaf level, we will be able to discard the key
+	 * data from the first item that winds up on the right page.
+	 */
+	if (!state->is_leaf)
+		rightfree += (int16) firstrightitemsz -
+			(int16) (MAXALIGN(sizeof(IndexTupleData)) + sizeof(ItemIdData));
+
+	/* Record split if legal */
+	if (leftfree >= 0 && rightfree >= 0)
+	{
+		Assert(state->nsplits < state->maxsplits);
+
+		/* Determine smallest firstright item size on page */
+		state->minfirstrightsz = Min(state->minfirstrightsz, firstrightitemsz);
+
+		state->splits[state->nsplits].curdelta = 0;
+		state->splits[state->nsplits].leftfree = leftfree;
+		state->splits[state->nsplits].rightfree = rightfree;
+		state->splits[state->nsplits].firstoldonright = firstoldonright;
+		state->splits[state->nsplits].newitemonleft = newitemonleft;
+		state->nsplits++;
+	}
+}
+
+/*
+ * Subroutine to assign space deltas to materialized array of candidate split
+ * points based on current fillfactor, and to sort array using that fillfactor
+ */
+static void
+_bt_deltasortsplits(FindSplitData *state, double fillfactormult,
+					bool usemult)
+{
+	for (int i = 0; i < state->nsplits; i++)
+	{
+		SplitPoint *split = state->splits + i;
+		int16		delta;
+
+		if (usemult)
+			delta = fillfactormult * split->leftfree -
+				(1.0 - fillfactormult) * split->rightfree;
+		else
+			delta = split->leftfree - split->rightfree;
+
+		if (delta < 0)
+			delta = -delta;
+
+		/* Save delta */
+		split->curdelta = delta;
+	}
+
+	qsort(state->splits, state->nsplits, sizeof(SplitPoint), _bt_splitcmp);
+}
+
+/*
+ * qsort-style comparator used by _bt_deltasortsplits()
+ */
+static int
+_bt_splitcmp(const void *arg1, const void *arg2)
+{
+	SplitPoint *split1 = (SplitPoint *) arg1;
+	SplitPoint *split2 = (SplitPoint *) arg2;
+
+	if (split1->curdelta > split2->curdelta)
+		return 1;
+	if (split1->curdelta < split2->curdelta)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Subroutine to find the "best" split point among an array of acceptable
+ * candidate split points that split without there being an excessively high
+ * delta between the space left free on the left and right halves.  The "best"
+ * split point is the split point with the lowest penalty among split points
+ * that fall within current/final split interval.  Penalty is an abstract
+ * score, with a definition that varies depending on whether we're splitting a
+ * leaf page or an internal page.  See _bt_split_penalty() for details.
+ *
+ * "perfectpenalty" is assumed to be the lowest possible penalty among
+ * candidate split points.  This allows us to return early without wasting
+ * cycles on calculating the first differing attribute for all candidate
+ * splits when that clearly cannot improve our choice (or when we only want a
+ * minimally distinguishing split point, and don't want to make the split any
+ * more unbalanced than is necessary).
+ *
+ * We return the index of the first existing tuple that should go on the right
+ * page, plus a boolean indicating if new item is on left of split point.
+ */
+static OffsetNumber
+_bt_bestsplitloc(FindSplitData *state, int perfectpenalty, bool *newitemonleft)
+{
+	int			bestpenalty,
+				lowsplit;
+	int			highsplit = Min(state->interval, state->nsplits);
+
+	/* No point in calculating penalty when there's only one choice */
+	if (state->nsplits == 1)
+	{
+		*newitemonleft = state->splits[0].newitemonleft;
+		return state->splits[0].firstoldonright;
+	}
+
+	bestpenalty = INT_MAX;
+	lowsplit = 0;
+	for (int i = lowsplit; i < highsplit; i++)
+	{
+		int			penalty;
+
+		penalty = _bt_split_penalty(state, state->splits + i);
+
+		if (penalty <= perfectpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+			break;
+		}
+
+		if (penalty < bestpenalty)
+		{
+			bestpenalty = penalty;
+			lowsplit = i;
+		}
+	}
+
+	*newitemonleft = state->splits[lowsplit].newitemonleft;
+	return state->splits[lowsplit].firstoldonright;
+}
+
+/*
+ * Subroutine to decide whether split should use default strategy/initial
+ * split interval, or whether it should finish splitting the page using
+ * alternative strategies (this is only possible with leaf pages).
+ *
+ * Caller uses alternative strategy (or sticks with default strategy) based
+ * on how *strategy is set here.  Return value is "perfect penalty", which is
+ * passed to _bt_bestsplitloc() as a final constraint on how far caller is
+ * willing to go to avoid appending a heap TID when using the many duplicates
+ * strategy (it also saves _bt_bestsplitloc() useless cycles).
+ */
+static int
+_bt_strategy(FindSplitData *state, SplitPoint *leftpage,
+			 SplitPoint *rightpage, FindSplitStrat *strategy)
+{
+	IndexTuple	leftmost,
+				rightmost;
+	SplitPoint *leftinterval,
+			   *rightinterval;
+	int			perfectpenalty;
+	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
+
+	/* Assume that alternative strategy won't be used for now */
+	*strategy = SPLIT_DEFAULT;
+
+	/*
+	 * Use smallest observed first right item size for entire page as perfect
+	 * penalty on internal pages.  This can save cycles in the common case
+	 * where most or all splits (not just splits within interval) have first
+	 * right tuples that are the same size.
+	 */
+	if (!state->is_leaf)
+		return state->minfirstrightsz;
+
+	/*
+	 * Use leftmost and rightmost tuples from leftmost and rightmost splits in
+	 * current split interval
+	 */
+	_bt_interval_edges(state, &leftinterval, &rightinterval);
+	leftmost = _bt_split_lastleft(state, leftinterval);
+	rightmost = _bt_split_firstright(state, rightinterval);
+
+	/*
+	 * If initial split interval can produce a split point that will at least
+	 * avoid appending a heap TID in new high key, we're done.  Finish split
+	 * with default strategy and initial split interval.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+		return perfectpenalty;
+
+	/*
+	 * Work out how caller should finish split when even their "perfect"
+	 * penalty for initial/default split interval indicates that the interval
+	 * does not contain even a single split that avoids appending a heap TID.
+	 *
+	 * Use the leftmost split's lastleft tuple and the rightmost split's
+	 * firstright tuple to assess every possible split.
+	 */
+	leftmost = _bt_split_lastleft(state, leftpage);
+	rightmost = _bt_split_firstright(state, rightpage);
+
+	/*
+	 * If page (including new item) has many duplicates but is not entirely
+	 * full of duplicates, a many duplicates strategy split will be performed.
+	 * If page is entirely full of duplicates, a single value strategy split
+	 * will be performed.
+	 */
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	if (perfectpenalty <= indnkeyatts)
+	{
+		*strategy = SPLIT_MANY_DUPLICATES;
+
+		/*
+		 * Caller should choose the lowest delta split that avoids appending a
+		 * heap TID.  Maximizing the number of attributes that can be
+		 * truncated away (returning perfectpenalty when it happens to be less
+		 * than the number of key attributes in index) can result in continual
+		 * unbalanced page splits.
+		 *
+		 * Just avoiding appending a heap TID can still make splits very
+		 * unbalanced, but this is self-limiting.  When final split has a very
+		 * high delta, one side of the split will likely consist of a single
+		 * value.  If that page is split once again, then that split will
+		 * likely use the single value strategy.
+		 */
+		return indnkeyatts;
+	}
+
+	/*
+	 * Single value strategy is only appropriate with ever-increasing heap
+	 * TIDs; otherwise, original default strategy split should proceed to
+	 * avoid pathological performance.  Use page high key to infer if this is
+	 * the rightmost page among pages that store the same duplicate value.
+	 * This should not prevent insertions of heap TIDs that are slightly out
+	 * of order from using single value strategy, since that's expected with
+	 * concurrent inserters of the same duplicate value.
+	 */
+	else if (state->is_rightmost)
+		*strategy = SPLIT_SINGLE_VALUE;
+	else
+	{
+		ItemId		itemid;
+		IndexTuple	hikey;
+
+		itemid = PageGetItemId(state->page, P_HIKEY);
+		hikey = (IndexTuple) PageGetItem(state->page, itemid);
+		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
+											 state->newitem);
+		if (perfectpenalty <= indnkeyatts)
+			*strategy = SPLIT_SINGLE_VALUE;
+		else
+		{
+			/*
+			 * Have caller finish split using default strategy, since page
+			 * does not appear to be the rightmost page for duplicates of the
+			 * value the page is filled with
+			 */
+		}
+	}
+
+	return perfectpenalty;
+}
+
+/*
+ * Subroutine to locate leftmost and rightmost splits for current/default
+ * split interval.  Note that it will be the same split iff there is only one
+ * split in interval.
+ */
+static void
+_bt_interval_edges(FindSplitData *state, SplitPoint **leftinterval,
+				   SplitPoint **rightinterval)
+{
+	int			highsplit = Min(state->interval, state->nsplits);
+	SplitPoint *deltaoptimal;
+
+	deltaoptimal = state->splits;
+	*leftinterval = NULL;
+	*rightinterval = NULL;
+
+	/*
+	 * Delta is an absolute distance to optimal split point, so both the
+	 * leftmost and rightmost split point will usually be at the end of the
+	 * array
+	 */
+	for (int i = highsplit - 1; i >= 0; i--)
+	{
+		SplitPoint *distant = state->splits + i;
+
+		if (distant->firstoldonright < deltaoptimal->firstoldonright)
+		{
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->firstoldonright > deltaoptimal->firstoldonright)
+		{
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else if (!distant->newitemonleft && deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become first on right page" (distant) is
+			 * to the left of "incoming tuple will become last on left page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+		}
+		else if (distant->newitemonleft && !deltaoptimal->newitemonleft)
+		{
+			/*
+			 * "incoming tuple will become last on left page" (distant) is to
+			 * the right of "incoming tuple will become first on right page"
+			 * (delta-optimal)
+			 */
+			Assert(distant->firstoldonright == state->newitemoff);
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+		else
+		{
+			/* There was only one or two splits in initial split interval */
+			Assert(distant == deltaoptimal);
+			if (*leftinterval == NULL)
+				*leftinterval = distant;
+			if (*rightinterval == NULL)
+				*rightinterval = distant;
+		}
+
+		if (*leftinterval && *rightinterval)
+			return;
+	}
+
+	Assert(false);
+}
+
+/*
+ * Subroutine to find penalty for caller's candidate split point.
+ *
+ * On leaf pages, penalty is the attribute number that distinguishes each side
+ * of a split.  It's the last attribute that needs to be included in new high
+ * key for left page.  It can be greater than the number of key attributes in
+ * cases where a heap TID will need to be appended during truncation.
+ *
+ * On internal pages, penalty is simply the size of the first item on the
+ * right half of the split (including line pointer overhead).  This tuple will
+ * become the new high key for the left page.
+ */
+static inline int
+_bt_split_penalty(FindSplitData *state, SplitPoint *split)
+{
+	IndexTuple	lastleftuple;
+	IndexTuple	firstrighttuple;
+
+	if (!state->is_leaf)
+	{
+		ItemId		itemid;
+
+		if (!split->newitemonleft &&
+			split->firstoldonright == state->newitemoff)
+			return state->newitemsz;
+
+		itemid = PageGetItemId(state->page, split->firstoldonright);
+
+		return MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+	}
+
+	lastleftuple = _bt_split_lastleft(state, split);
+	firstrighttuple = _bt_split_firstright(state, split);
+
+	Assert(lastleftuple != firstrighttuple);
+	return _bt_keep_natts_fast(state->rel, lastleftuple, firstrighttuple);
+}
+
+/*
+ * Subroutine to get a lastleft IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page,
+						   OffsetNumberPrev(split->firstoldonright));
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
+
+/*
+ * Subroutine to get a firstright IndexTuple for a spit point from page
+ */
+static inline IndexTuple
+_bt_split_firstright(FindSplitData *state, SplitPoint *split)
+{
+	ItemId		itemid;
+
+	if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
+		return state->newitem;
+
+	itemid = PageGetItemId(state->page, split->firstoldonright);
+	return (IndexTuple) PageGetItem(state->page, itemid);
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 12fd0a7b0d..8fe54f6434 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -22,6 +22,7 @@
 #include "access/relscan.h"
 #include "miscadmin.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -2295,6 +2296,60 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	return keepnatts;
 }
 
+/*
+ * _bt_keep_natts_fast - fast bitwise variant of _bt_keep_natts.
+ *
+ * This is exported so that a candidate split point can have its effect on
+ * suffix truncation inexpensively evaluated ahead of time when finding a
+ * split location.  A naive bitwise approach to datum comparisons is used to
+ * save cycles.
+ *
+ * The approach taken here usually provides the same answer as _bt_keep_natts
+ * will (for the same pair of tuples from a heapkeyspace index), since the
+ * majority of btree opclasses can never indicate that two datums are equal
+ * unless they're bitwise equal (once detoasted).  Similarly, result may
+ * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
+ * though this is barely possible in practice.
+ *
+ * These issues must be acceptable to callers, typically because they're only
+ * concerned about making suffix truncation as effective as possible without
+ * leaving excessive amounts of free space on either side of page split.
+ * Callers can rely on the fact that attributes considered equal here are
+ * definitely also equal according to _bt_keep_natts.
+ */
+int
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= keysz; attnum++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		att = TupleDescAttr(itupdesc, attnum - 1);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
+}
+
 /*
  *  _bt_check_natts() -- Verify tuple has expected number of attributes.
  *
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 928db4c63d..acb01b3f00 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -160,11 +160,15 @@ typedef struct BTMetaPageData
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
  * a rightmost page; when splitting non-rightmost pages we try to
- * divide the data equally.
+ * divide the data equally.  When splitting a page that's entirely
+ * filled with a single value (duplicates), the effective leaf-page
+ * fillfactor is 96%, regardless of whether the page is a rightmost
+ * page.
  */
 #define BTREE_MIN_FILLFACTOR		10
 #define BTREE_DEFAULT_FILLFACTOR	90
 #define BTREE_NONLEAF_FILLFACTOR	70
+#define BTREE_SINGLEVAL_FILLFACTOR	96
 
 /*
  *	In general, the btree code tries to localize its knowledge about
@@ -713,6 +717,13 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
 
+/*
+ * prototypes for functions in nbtsplitloc.c
+ */
+extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
+				 OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+				 bool *newitemonleft);
+
 /*
  * prototypes for functions in nbtpage.c
  */
@@ -779,6 +790,8 @@ extern bool btproperty(Oid index_oid, int attno,
 		   bool *res, bool *isnull);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 			 IndexTuple firstright, BTScanInsert itup_key);
+extern int _bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
+					IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 				OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
-- 
2.17.1

v18-0006-Add-high-key-continuescan-optimization.patchapplication/octet-stream; name=v18-0006-Add-high-key-continuescan-optimization.patchDownload

From 16e0303409addd3ec1c7296848e5eaa8991fe271 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 14 Mar 2019 12:01:48 -0700
Subject: [PATCH v18 6/7] Add high key "continuescan" optimization.

Teach B-Tree forward index scans to check the high key before moving to
the next page in the hopes of finding that it isn't actually necessary
to move to the next page.  We already opportunistically force a key
check of the final non-pivot item, even when it's clear that it cannot
be returned to the scan due to being dead-to-all, for the same reason.
Since forcing the final item to be key checked no longer makes any
difference in the case of forward scans, the existing extra key check is
now only used for backwards scans (where the final non-pivot tuple is
the first non-pivot tuple on the page).  Like the existing check, the
new check won't always work out, but that seems like an acceptable price
to pay.

The new approach is more effective than just checking non-pivot tuples,
especially with composite indexes and non-unique indexes.  Recent
improvements to the logic for picking a split point make it likely that
relatively dissimilar high keys will appear on a page.  A distinguishing
key value that only appears on non-pivot tuples on the page to the right
will often be present in leaf high keys.

Note that even pg_upgrade'd v3 indexes make use of this optimization.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/CAH2-WzkOmUduME31QnuTFpimejuQoiZ-HOf0pOWeFZNhTMctvA@mail.gmail.com
---
 src/backend/access/nbtree/nbtsearch.c |  89 ++++++++++++++++++--
 src/backend/access/nbtree/nbtutils.c  | 113 +++++++++++---------------
 src/include/access/nbtree.h           |   5 +-
 3 files changed, 130 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index f58774da82..8100654a55 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1406,8 +1406,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	OffsetNumber minoff;
 	OffsetNumber maxoff;
 	int			itemIndex;
-	IndexTuple	itup;
 	bool		continuescan;
+	int			indnatts;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1427,6 +1427,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
 	}
 
+	continuescan = true;		/* default assumption */
+	indnatts = IndexRelationGetNumberOfAttributes(scan->indexRelation);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1468,23 +1470,60 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum <= maxoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  It's a win to
+			 * not bother examining the tuple's index keys, but just skip to
+			 * the next tuple.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				offnum = OffsetNumberNext(offnum);
+				continue;
+			}
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				_bt_saveitem(so, itemIndex, offnum, itup);
 				itemIndex++;
 			}
+			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
-			{
-				/* there can't be any more matches, so stop */
-				so->currPos.moreRight = false;
 				break;
-			}
 
 			offnum = OffsetNumberNext(offnum);
 		}
 
+		/*
+		 * We don't need to visit page to the right when the high key
+		 * indicates that no more matches will be found there.
+		 *
+		 * Checking the high key like this works out more often than you might
+		 * think.  Leaf page splits pick a split point between the two most
+		 * dissimilar tuples (this is weighed against the need to evenly share
+		 * free space).  Leaf pages with high key attribute values that can
+		 * only appear on non-pivot tuples on the right sibling page are
+		 * common.
+		 */
+		if (continuescan && !P_RIGHTMOST(opaque))
+		{
+			ItemId		iid = PageGetItemId(page, P_HIKEY);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, iid);
+			int			truncatt;
+
+			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+		}
+
+		if (!continuescan)
+			so->currPos.moreRight = false;
+
 		Assert(itemIndex <= MaxIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
@@ -1499,8 +1538,40 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		while (offnum >= minoff)
 		{
-			itup = _bt_checkkeys(scan, page, offnum, dir, &continuescan);
-			if (itup != NULL)
+			ItemId		iid = PageGetItemId(page, offnum);
+			IndexTuple	itup;
+			bool		tuple_alive;
+			bool		passes_quals;
+
+			/*
+			 * If the scan specifies not to return killed tuples, then we
+			 * treat a killed tuple as not passing the qual.  Most of the
+			 * time, it's a win to not bother examining the tuple's index
+			 * keys, but just skip to the next tuple (previous, actually,
+			 * since we're scanning backwards).  However, if this is the first
+			 * tuple on the page, we do check the index keys, to prevent
+			 * uselessly advancing to the page to the left.  This is similar
+			 * to the high key optimization used by forward scans.
+			 */
+			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+			{
+				Assert(offnum >= P_FIRSTDATAKEY(opaque));
+				if (offnum > P_FIRSTDATAKEY(opaque))
+				{
+					offnum = OffsetNumberPrev(offnum);
+					continue;
+				}
+
+				tuple_alive = false;
+			}
+			else
+				tuple_alive = true;
+
+			itup = (IndexTuple) PageGetItem(page, iid);
+
+			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
+										 &continuescan);
+			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
 				itemIndex--;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 8fe54f6434..6eefda0451 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -48,7 +48,7 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
 static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
-					 IndexTuple tuple, TupleDesc tupdesc,
+					 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
 static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
 			   IndexTuple firstright, BTScanInsert itup_key);
@@ -1333,73 +1333,35 @@ _bt_mark_scankey_required(ScanKey skey)
 /*
  * Test whether an indextuple satisfies all the scankey conditions.
  *
- * If so, return the address of the index tuple on the index page.
- * If not, return NULL.
+ * Return true if so, false if not.  If the tuple fails to pass the qual,
+ * we also determine whether there's any need to continue the scan beyond
+ * this tuple, and set *continuescan accordingly.  See comments for
+ * _bt_preprocess_keys(), above, about how this is done.
  *
- * If the tuple fails to pass the qual, we also determine whether there's
- * any need to continue the scan beyond this tuple, and set *continuescan
- * accordingly.  See comments for _bt_preprocess_keys(), above, about how
- * this is done.
+ * Forward scan callers can pass a high key tuple in the hopes of having
+ * us set *continuescanthat to false, and avoiding an unnecessary visit to
+ * the page to the right.
  *
  * scan: index scan descriptor (containing a search-type scankey)
- * page: buffer page containing index tuple
- * offnum: offset number of index tuple (must be a valid item!)
+ * tuple: index tuple to test
+ * tupnatts: number of attributes in tupnatts (high key may be truncated)
  * dir: direction we are scanning in
  * continuescan: output parameter (will be set correctly in all cases)
- *
- * Caller must hold pin and lock on the index page.
  */
-IndexTuple
-_bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
+bool
+_bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 			  ScanDirection dir, bool *continuescan)
 {
-	ItemId		iid = PageGetItemId(page, offnum);
-	bool		tuple_alive;
-	IndexTuple	tuple;
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
 
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
 	*continuescan = true;		/* default assumption */
 
-	/*
-	 * If the scan specifies not to return killed tuples, then we treat a
-	 * killed tuple as not passing the qual.  Most of the time, it's a win to
-	 * not bother examining the tuple's index keys, but just return
-	 * immediately with continuescan = true to proceed to the next tuple.
-	 * However, if this is the last tuple on the page, we should check the
-	 * index keys to prevent uselessly advancing to the next page.
-	 */
-	if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
-	{
-		/* return immediately if there are more tuples on the page */
-		if (ScanDirectionIsForward(dir))
-		{
-			if (offnum < PageGetMaxOffsetNumber(page))
-				return NULL;
-		}
-		else
-		{
-			BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-			if (offnum > P_FIRSTDATAKEY(opaque))
-				return NULL;
-		}
-
-		/*
-		 * OK, we want to check the keys so we can set continuescan correctly,
-		 * but we'll return NULL even if the tuple passes the key tests.
-		 */
-		tuple_alive = false;
-	}
-	else
-		tuple_alive = true;
-
-	tuple = (IndexTuple) PageGetItem(page, iid);
-
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
 	keysz = so->numberOfKeys;
@@ -1410,13 +1372,25 @@ _bt_checkkeys(IndexScanDesc scan,
 		bool		isNull;
 		Datum		test;
 
-		Assert(key->sk_attno <= BTreeTupleGetNAtts(tuple, scan->indexRelation));
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
 		/* row-comparison keys need special processing */
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
-			if (_bt_check_rowcompare(key, tuple, tupdesc, dir, continuescan))
+			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
+									 continuescan))
 				continue;
-			return NULL;
+			return false;
 		}
 
 		datum = index_getattr(tuple,
@@ -1454,7 +1428,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		if (isNull)
@@ -1495,7 +1469,7 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 
 		test = FunctionCall2Coll(&key->sk_func, key->sk_collation,
@@ -1523,16 +1497,12 @@ _bt_checkkeys(IndexScanDesc scan,
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
-			return NULL;
+			return false;
 		}
 	}
 
-	/* Check for failure due to it being a killed tuple. */
-	if (!tuple_alive)
-		return NULL;
-
 	/* If we get here, the tuple passes all index quals. */
-	return tuple;
+	return true;
 }
 
 /*
@@ -1545,8 +1515,8 @@ _bt_checkkeys(IndexScanDesc scan,
  * This is a subroutine for _bt_checkkeys, which see for more info.
  */
 static bool
-_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
-					 ScanDirection dir, bool *continuescan)
+_bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1563,6 +1533,19 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc,
 
 		Assert(subkey->sk_flags & SK_ROW_MEMBER);
 
+		if (subkey->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			cmpresult = 0;
+			continue;
+		}
+
 		datum = index_getattr(tuple,
 							  subkey->sk_attno,
 							  tupdesc,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index acb01b3f00..9c70fced23 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -774,9 +774,8 @@ extern bool _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir);
 extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
-extern IndexTuple _bt_checkkeys(IndexScanDesc scan,
-			  Page page, OffsetNumber offnum,
-			  ScanDirection dir, bool *continuescan);
+extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
+			  int tupnatts, ScanDirection dir, bool *continuescan);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
-- 
2.17.1

v18-0007-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v18-0007-DEBUG-Add-pageinspect-instrumentation.patchDownload

From 4c58762b522ee1e3e6fbd7d6963a259b2cbd8c84 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v18 7/7] DEBUG: Add pageinspect instrumentation.

Have pageinspect display user-visible attribute values.

This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.

The following query can be used with this hacked pageinspect, which
visualizes the internal pages:

"""

with recursive index_details as (
  select
    'my_test_index'::text idx
),
size_in_pages_index as (
  select
    (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
    index_details
),
page_stats as (
  select
    index_details.*,
    stats.*
  from
    index_details,
    size_in_pages_index,
    lateral (select i from generate_series(1, size_pages - 1) i) series,
    lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
    *
  from
    page_stats
  where
    type != 'l'),
meta_stats as (
  select
    *
  from
    index_details s,
    lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
    *
  from
    internal_page_stats
  order by
    btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
    1,
    blkno,
    btpo
  from
    internal_items
  where
    btpo_prev = 0
    and btpo = (select level from meta_stats)
  union
  select
    case when level = btpo then o.item + 1 else 1 end,
    blkno,
    btpo
  from
    internal_items i,
    ordered_internal_items o
  where
    i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  --idx,
  btpo as level,
  item as l_item,
  blkno,
  --btpo_prev,
  --btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.  Show heap TID for both pivot and non-pivot tuples here.
  case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
                                 from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
 contrib/pageinspect/btreefuncs.c              | 68 +++++++++++++++----
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
 3 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..95c81c0808 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
 
 #include "pageinspect.h"
 
+#include "access/genam.h"
 #include "access/nbtree.h"
 #include "access/relation.h"
 #include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
  */
 struct user_args
 {
+	Relation	rel;
 	Page		page;
 	OffsetNumber offset;
 };
@@ -254,9 +256,9 @@ struct user_args
  * ------------------------------------------------------
  */
 static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
 {
-	char	   *values[6];
+	char	   *values[7];
 	HeapTuple	tuple;
 	ItemId		id;
 	IndexTuple	itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	int			dlen;
 	char	   *dump;
 	char	   *ptr;
+	ItemPointer htid;
+	BTPageOpaque opaque;
 
 	id = PageGetItemId(page, offset);
 
@@ -283,16 +287,53 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
 	values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
 
 	ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
-	dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-	dump = palloc0(dlen * 3 + 1);
-	values[j] = dump;
-	for (off = 0; off < dlen; off++)
+	if (rel)
 	{
-		if (off > 0)
-			*dump++ = ' ';
-		sprintf(dump, "%02x", *(ptr + off) & 0xff);
-		dump += 2;
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		Datum		datvalues[INDEX_MAX_KEYS];
+		bool		isnull[INDEX_MAX_KEYS];
+		int			natts;
+		int			indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(itup, rel);
+
+		itupdesc->natts = natts;
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(itup, itupdesc, datvalues, isnull);
+		indnkeyatts = rel->rd_index->indnkeyatts;
+		rel->rd_index->indnkeyatts = natts;
+		values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
 	}
+	else
+	{
+		dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+		dump = palloc0(dlen * 3 + 1);
+		values[j++] = dump;
+		for (off = 0; off < dlen; off++)
+		{
+			if (off > 0)
+				*dump++ = ' ';
+			sprintf(dump, "%02x", *(ptr + off) & 0xff);
+			dump += 2;
+		}
+	}
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+		htid = &itup->t_tid;
+	else if (_bt_heapkeyspace(rel))
+		htid = BTreeTupleGetHeapTID(itup);
+	else
+		htid = NULL;
+
+	if (htid)
+		values[j] = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumberNoCheck(htid),
+							 ItemPointerGetOffsetNumberNoCheck(htid));
+	else
+		values[j] = NULL;
 
 	tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
 
@@ -366,11 +407,11 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 		uargs = palloc(sizeof(struct user_args));
 
+		uargs->rel = rel;
 		uargs->page = palloc(BLCKSZ);
 		memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
 
 		UnlockReleaseBuffer(buffer);
-		relation_close(rel, AccessShareLock);
 
 		uargs->offset = FirstOffsetNumber;
 
@@ -397,12 +438,13 @@ bt_page_items(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
 	else
 	{
+		relation_close(uargs->rel, AccessShareLock);
 		pfree(uargs->page);
 		pfree(uargs);
 		SRF_RETURN_DONE(fctx);
@@ -482,7 +524,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 
 	if (fctx->call_cntr < fctx->max_calls)
 	{
-		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+		result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
 		uargs->offset++;
 		SRF_RETURN_NEXT(fctx, result);
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid       | (0,1)
 itemlen    | 16
 nulls      | f
 vars       | f
-data       | 01 00 00 00 00 00 00 01
+data       | (a)=(72057594037927937)
+htid       | (0,1)
 
 SELECT * FROM bt_page_items('test1_a_idx', 2);
 ERROR:  block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
     OUT last_cleanup_num_tuples real)
 AS 'MODULE_PATHNAME', 'bt_metap'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+    OUT itemoffset smallint,
+    OUT ctid tid,
+    OUT itemlen smallint,
+    OUT nulls bool,
+    OUT vars bool,
+    OUT data text,
+    OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
-- 
2.17.1

v18-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchapplication/octet-stream; name=v18-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchDownload

From d357aa091a1c3458000cf9511f7c8f6a1d9dffcf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH v18 2/7] Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 341 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 136 +++---
 src/backend/access/nbtree/nbtinsert.c        | 305 +++++++++-----
 src/backend/access/nbtree/nbtpage.c          | 204 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 106 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 412 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  47 +--
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 198 +++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   9 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 src/test/regress/sql/foreign_data.sql        |   2 +-
 29 files changed, 1557 insertions(+), 516 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c..1e6079ddd2 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476..3f1e0d17ef 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 5426bfd8d8..4363e6b82e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -46,6 +46,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -67,6 +69,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -123,7 +127,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -138,17 +142,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -205,6 +214,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -255,7 +265,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -325,8 +337,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -347,6 +359,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -807,7 +820,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -840,6 +854,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -866,7 +881,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -907,7 +923,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -941,9 +1006,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -969,11 +1060,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1036,7 +1126,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1214,9 +1304,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1305,7 +1395,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup);
 }
 
 /*
@@ -1368,7 +1458,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1417,14 +1508,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1856,6 +1962,64 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	Assert(key->pivotsearch);
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  However, it is not capable of determining that a
+	 * scankey is _less than_ a tuple on the basis of a comparison resolved at
+	 * _scankey_ minus infinity attribute.  Complete an extra step to simulate
+	 * having minus infinity values for omitted scankey attribute(s).
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		/* Heap TID is tiebreaker key attribute */
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1869,48 +2033,97 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		/* Heap TID is tiebreaker key attribute */
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2066,3 +2279,53 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically prevents insertion scankey from
+ * being considered greater than the pivot tuple that its values originated
+ * from (or some other identical pivot tuple) in the common case where there
+ * are truncated/minus infinity attributes.  Without this extra step, there
+ * are forms of corruption that amcheck could theoretically fail to report.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
+ * might otherwise cause amcheck to assume (rather than actually verify) that
+ * the scankey is greater.
+ */
+static inline BTScanInsert
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->pivotsearch = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f..8d27c9b0f6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b..07c2dcd771 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d4..9920dbfd40 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 9943e8ecd4..3493f482b8 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a..cb23be859d 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a295a7a286..40ff25fe06 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -615,22 +626,40 @@ scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -643,20 +672,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -664,4 +699,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index ed6b2692cb..d28d18496f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -62,14 +62,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  Relation heapRel);
 static bool _bt_useduplicatepage(Relation rel, Relation heapRel,
 					 BTInsertState insertstate);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -117,6 +119,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * Fill in the BTInsertState working area, to track the current page and
@@ -232,12 +237,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tiebreaker attribute.  Any other would-be inserter of
+	 * the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -275,6 +281,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -283,12 +293,12 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  The actual
+		 * location of the insert is unsettled in the checkingunique case
+		 * because scantid was not filled in initially, but it's okay to use
+		 * the "first valid" page instead.  This reasoning also applies to
+		 * INCLUDE indexes, whose extra attributes are not considered part of
+		 * the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
 
@@ -299,8 +309,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
-					   newitemoff, false);
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+					   itup, newitemoff, false);
 	}
 	else
 	{
@@ -372,6 +382,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -644,20 +655,26 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
  *		On entry, insertstate buffer contains the page that the new tuple
- *		belongs on, which is exclusive-locked and pinned by caller.  This
- *		won't be exactly the right page for some callers to insert on to.
- *		They'll have to insert into a page somewhere to the right.
+ *		belongs on, which is exclusive-locked and pinned by caller.
+ *		Occasionally, this won't be exactly right for callers that just called
+ *		_bt_check_unique(), and did initial search without using a scantid.
+ *		They'll have to insert into a page somewhere to the right in rare
+ *		cases where there are many physical duplicates in a unique index, and
+ *		their scantid directs us to some page full of duplicates to the right,
+ *		where the new tuple must go.  (Actually, since !heapkeyspace
+ *		pg_upgraded'd non-unique indexes never get a scantid, they too may
+ *		require that we move right.  We treat them somewhat like unique
+ *		indexes.)
  *
  *		On exit, insertstate buffer contains the chosen insertion page, and
  *		the offset within that page is returned.  The lock and pin on the
- *		original page are released in cases where initial page is not where
- *		tuple belongs.  New buffer/page passed back to the caller is
+ *		original page are released in rare cases where initial page is not
+ *		where tuple belongs.  New page passed back to the caller is
  *		exclusively locked and pinned, just like initial page was.
  *
  *		_bt_check_unique() saves the progress of the binary search it
  *		performs, also in the insertion state.  We don't need to do any
- *		additional binary search comparisons here most of the time, provided
- *		caller is to insert on original page.
+ *		additional binary search comparisons here much of the time.
  *
  *		This is also where opportunistic microvacuuming of LP_DEAD tuples
  *		occurs.  It convenient to make it happen here, since microvacuuming
@@ -678,29 +695,27 @@ _bt_findinsertloc(Relation rel,
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (insertstate->itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						insertstate->itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 insertstate->itup);
 
+	/*
+	 * We may have to walk right through leaf pages to find the one leaf page
+	 * that we must insert on to, though only when inserting into unique
+	 * indexes.  This is necessary because a scantid is not used by the
+	 * insertion scan key initially in the case of unique indexes -- a scantid
+	 * is only set after the absence of duplicates (whose heap tuples are not
+	 * dead or recently dead) has been established by _bt_check_unique().
+	 * Non-unique index insertions will break out of the loop immediately.
+	 *
+	 * (Actually, non-unique indexes may still need to grovel through leaf
+	 * pages full of duplicates with a pg_upgrade'd !heapkeyspace index.)
+	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
 	Assert(!insertstate->bounds_valid || checkingunique);
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
 	for (;;)
 	{
 		int			cmpval;
@@ -708,6 +723,13 @@ _bt_findinsertloc(Relation rel,
 		BlockNumber rblkno;
 
 		/*
+		 * Fastpaths that avoid extra high key check.
+		 *
+		 * No need to check high key when inserting into a non-unique index;
+		 * _bt_search() already checked this when it checked if a move to the
+		 * right was required for leaf page.  Insertion scankey's scantid
+		 * would have been filled out at the time.
+		 *
 		 * An earlier _bt_check_unique() call may well have established bounds
 		 * that we can use to skip the high key check for checkingunique
 		 * callers.  This fastpath cannot be used when there are no items on
@@ -715,23 +737,35 @@ _bt_findinsertloc(Relation rel,
 		 * new item belongs last on the page, but it might go on a later page
 		 * instead.
 		 */
-		if (insertstate->bounds_valid &&
-			insertstate->low <= insertstate->stricthigh &&
-			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+		if (!checkingunique && itup_key->heapkeyspace)
+			break;
+		else if (insertstate->bounds_valid &&
+				 insertstate->low <= insertstate->stricthigh &&
+				 insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
 			break;
 
 		/* If this is the page that the tuple must go on, stop */
 		if (P_RIGHTMOST(lpageop))
 			break;
 		cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
-
-		/*
-		 * May have to handle case where there is a choice of which page to
-		 * place new tuple on, and we must balance space utilization as best
-		 * we can.  Note that this may invalidate cached bounds for us.
-		 */
-		if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
-			break;
+		if (itup_key->heapkeyspace)
+		{
+			if (cmpval <= 0)
+				break;
+		}
+		else
+		{
+			/*
+			 * pg_upgrade'd !heapkeyspace index.
+			 *
+			 * May have to handle legacy case where there is a choice of which
+			 * page to place new tuple on, and we must balance space
+			 * utilization as best we can.  Note that this may invalidate
+			 * cached bounds for us.
+			 */
+			if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
+				break;
+		}
 
 		/*
 		 * step right to next non-dead page
@@ -740,6 +774,8 @@ _bt_findinsertloc(Relation rel,
 		 * page; else someone else's _bt_check_unique scan could fail to see
 		 * our insertion.  write locks on intermediate dead pages won't do
 		 * because we don't know when they will get de-linked from the tree.
+		 * (this is more aggressive than it needs to be for non-unique
+		 * !heapkeyspace indexes.)
 		 */
 		rbuf = InvalidBuffer;
 
@@ -802,9 +838,16 @@ _bt_findinsertloc(Relation rel,
 /*
  *	_bt_useduplicatepage() -- Settle for this page of duplicates?
  *
- *		This function handles the question of whether or not an insertion of
- *		a duplicate into an index should insert on the page contained in buf
- *		when a choice must be made.
+ *		Prior to PostgreSQL 12/Btree version 4, heap TID was never treated
+ *		as a part of the keyspace.  If there were many tuples of the same
+ *		value spanning more than one leaf page, a new tuple of that same
+ *		value could legally be placed on any one of the pages.
+ *
+ *		This function handles the question of whether or not an insertion
+ *		of a duplicate into a pg_upgrade'd !heapkeyspace index should insert
+ *		on the page contained in buf when a choice must be made.  It is only
+ *		used with pg_upgrade'd version 2 and version 3 indexes (!heapkeyspace
+ *		indexes).
  *
  *		If the current page doesn't have enough free space for the new tuple
  *		we "microvacuum" the page, removing LP_DEAD items, in the hope that it
@@ -822,6 +865,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, BTInsertState insertstate)
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISLEAF(lpageop) && !P_RIGHTMOST(lpageop));
+	Assert(!insertstate->itup_key->heapkeyspace);
 
 	/* Easy case -- there is space free on this page already */
 	if (PageGetFreeSpace(page) >= insertstate->itemsz)
@@ -870,8 +914,9 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, BTInsertState insertstate)
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page, using 'itup_key' for
+ *			   suffix truncation on leaf pages (caller passes NULL for
+ *			   non-leaf pages).
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -897,6 +942,7 @@ _bt_useduplicatepage(Relation rel, Relation heapRel, BTInsertState insertstate)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -919,7 +965,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -969,8 +1015,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1054,7 +1100,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1109,6 +1155,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1176,17 +1224,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1280,7 +1330,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1294,8 +1345,29 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach of
+	 * using the left page's last item as its new high key when splitting on
+	 * the leaf level.  It isn't, though: suffix truncation will leave the
+	 * left page's high key fully equal to the last item on the left page when
+	 * two tuples with equal key values (excluding heap TID) enclose the split
+	 * point.  It isn't actually necessary for a new leaf high key to be equal
+	 * to the last item on the left for the L&Y "subtree" invariant to hold.
+	 * It's sufficient to make sure that the new leaf high key is strictly
+	 * less than the first item on the right leaf page, and greater than or
+	 * equal to (not necessarily equal to) the last item on the left leaf
+	 * page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap TID
+	 * from lastleft will be used as the new high key, since the last left
+	 * tuple could be physically larger despite being opclass-equal in respect
+	 * of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1313,25 +1385,48 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate unneeded key and non-key attributes of the high key item
+	 * before inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * level high keys.  A pivot tuple in a grandparent page must guide a
+	 * search not only to the correct parent page, but also to the correct
+	 * leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  This
+		 * is needed to decide how many attributes from the first item on the
+		 * right page must remain in new high key for left page.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1524,7 +1619,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1553,22 +1647,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1584,9 +1666,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1949,7 +2029,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1979,7 +2059,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2240,7 +2320,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2275,7 +2355,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2308,6 +2389,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2372,6 +2455,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2387,8 +2471,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2401,6 +2485,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d38..e046a0570b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1362,9 +1434,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 		{
 			/*
 			 * We need an approximate pointer to the page's parent page.  We
-			 * use the standard search mechanism to search for the page's high
-			 * key; this will give us a link to either the current parent or
-			 * someplace to its left (if there are multiple equal high keys).
+			 * use a variant of the standard search mechanism to search for
+			 * the page's high key; this will give us a link to either the
+			 * current parent or someplace to its left (if there are multiple
+			 * equal high keys, which is possible with !heapkeyspace indexes).
 			 *
 			 * Also check if this is the right-half of an incomplete split
 			 * (see comment above).
@@ -1422,7 +1495,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* get stack to leaf page by searching index */
+				/* find the leftmost leaf page with matching pivot/high key */
+				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 60e0b90ccf..ac6f1eb342 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0305469ad0..f58774da82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -347,6 +353,9 @@ _bt_binsrch(Relation rel,
 	int32		result,
 				cmpval;
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -554,10 +563,14 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -567,6 +580,7 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -580,8 +594,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -632,8 +648,77 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches have a scankey that is considered greater than a
+		 * truncated pivot tuple if and when the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in tuple.
+		 *
+		 * For example, if an index has the minimum two attributes (single
+		 * user key attribute, plus heap TID attribute), and a page's high key
+		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+		 * will not descend to the page to the left.  The search will descend
+		 * right instead.  The truncated attribute in pivot tuple means that
+		 * all non-pivot tuples on the page to the left are strictly < 'foo',
+		 * so it isn't necessary to descend left.  In other words, search
+		 * doesn't have to descend left because it isn't interested in a match
+		 * that has a heap TID value of -inf.
+		 *
+		 * However, some searches (pivotsearch searches) actually require that
+		 * we descend left when this happens.  -inf is treated as a possible
+		 * match for omitted scankey attribute(s).  This is needed by page
+		 * deletion, which must re-find leaf pages that are targets for
+		 * deletion using their high keys.
+		 *
+		 * Note: the heap TID part of the test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key
+		 * attributes.
+		 *
+		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+		 * left here, since they have no heap TID attribute (and cannot have
+		 * any -inf key values in any case, since truncation can only remove
+		 * non-key attributes).  !heapkeyspace searches must always be
+		 * prepared to deal with matches on both sides of the pivot once the
+		 * leaf level is reached.
+		 */
+		if (key->heapkeyspace && !key->pivotsearch &&
+			key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1148,7 +1233,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
 	inskey.nextkey = nextkey;
+	inskey.pivotsearch = false;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a0e2e70cef..2762a2d548 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -755,6 +755,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -808,8 +809,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -826,27 +825,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -892,24 +885,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -917,7 +921,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -936,8 +944,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -982,7 +991,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1041,8 +1050,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1135,6 +1145,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1142,7 +1154,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1159,6 +1170,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 0250e089a6..12fd0a7b0d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,26 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		Result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to relocate leaf page with matching high key, but
+ *		then caller needs to set scan key's pivotsearch field to true.  This
+ *		allows caller to search for a leaf page with a matching high key,
+ *		which is usually to the left of the first leaf page a non-pivot match
+ *		might appear on.
+ *
+ *		Result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
+ *		for _bt_truncate()) can pass a NULL index tuple.  The scankey will
+ *		be initialized as if an "all truncated" pivot tuple was passed
+ *		instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,13 +98,18 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
 	key->nextkey = false;
+	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -101,9 +125,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are
-		 * defensively represented as NULL values.  They should never be
-		 * used.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values. They
+		 * should never be used.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2041,38 +2065,234 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
 
-	return truncated;
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
+	 */
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
+
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only usable value.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 2/3 tuples
+	 * across Postgres versions; don't allow new pivot tuples to have
+	 * truncated key attributes there.  _bt_compare() treats truncated key
+	 * attributes as having the value minus infinity, which would break
+	 * searches within !heapkeyspace indexes.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2086,15 +2306,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2114,16 +2336,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
+			/*
+			 * Leaf tuples that are not the page high key (non-pivot tuples)
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
+			 */
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2133,8 +2365,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2146,7 +2385,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2157,18 +2400,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tiebreaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df..7f261db901 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -282,8 +266,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		Page		lpage = (Page) BufferGetPage(lbuf);
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
-		IndexTuple	newitem = NULL;
-		Size		newitemsz = 0;
+		IndexTuple	newitem,
+					left_hikey;
+		Size		newitemsz,
+					left_hikeysz;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab..fcac0cd8a9 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 16bda5c586..3eebd9ef51 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index feb66f8f1c..928db4c63d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,45 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree versions 2 and 3, so that they don't
+ * need to be immediately re-indexed at pg_upgrade.  In order to get the
+ * new heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is mostly the same as version 3.  There are two new
+ * fields in the metapage that were introduced in version 3.  A version 2
+ * metapage will be automatically upgraded to version 3 on the first
+ * insert to it.  INCLUDE indexes cannot use version 2.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tiebreaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +214,84 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
+ * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
+ * described here.
  *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * Non-pivot tuple format:
  *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tiebreaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set.
+ *
+ * All other types of index tuples ("pivot" tuples) only have key columns,
+ * since pivot tuples only exist to represent how the key space is
+ * separated.  In general, any B-Tree index that has more than one level
+ * (i.e. any index that does not just consist of a metapage and a single
+ * leaf root page) must have some number of pivot tuples, since pivot
+ * tuples are used for traversing the tree.  Suffix truncation can omit
+ * trailing key columns when a new pivot is formed, which makes minus
+ * infinity their logical value.  Since BTREE_VERSION 4 indexes treat heap
+ * TID as a trailing key column that ensures that all index tuples are
+ * physically unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
+ * pivot tuples.  In that case, the number of key columns is implicitly
+ * the same as the number of key columns in the index.  It is not usually
+ * set on version 2 indexes, which predate the introduction of INCLUDE
+ * indexes.  (Only explicitly truncated pivot tuples explicitly represent
+ * the number of key columns on versions 2 and 3, whereas all pivot tuples
+ * are formed using truncation on version 4.  A version 2 index will have
+ * it set for an internal page negative infinity item iff internal page
+ * split occurred after upgrade to Postgres 11+.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Note well: The macros that deal with the number of attributes in tuples
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
+ * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
+ * tuple (or must have the same number of attributes as the index has
+ * generally in the case of !heapkeyspace indexes).  They will need to be
+ * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
+ * for something else.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +314,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tiebreaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in all heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +332,34 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -325,22 +424,46 @@ typedef BTStackData *BTStack;
  * be confused with a search scankey).  It's used to descend a B-Tree using
  * _bt_search.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be
+ * physically unique because heap TID is used as a tiebreaker attribute, and
+ * if index may have truncated key attributes in pivot tuples.  This is
+ * always the case with indexes whose version is >= version 4 (and never the
+ * case with earlier versions).  This state will never vary for the same
+ * index (unless there is a REINDEX of a pg_upgrade'd index), but it's
+ * convenient to keep it close by.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * pivotsearch is set to true by callers that want to relocate a leaf page
+ * using a scankey built from leaf pages high key.  Most callers set this to
+ * false.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * scantid is the heap TID that is used as a final tiebreaker attribute,
+ * which may be set to NULL to indicate its absence.  Must be set when
+ * inserting new tuples into heapkeyspace indexes, since every tuple in
+ * the tree unambiguously belongs in one exact position (it's never set
+ * with !heapkeyspace indexes, though).  Despite the representational
+ * difference, nbtree search code considers scantid to be just another
+ * insertion scankey attribute.
+ *
+ * keysz is the number of insertion scankeys present, not including scantid.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 typedef struct BTScanInsertData
 {
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
 	bool		nextkey;
+	bool		pivotsearch;	/* Searching for pivot tuple? */
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -600,6 +723,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -653,8 +777,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c98..6320a0098f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6..ff443a476c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0c..54d3eee197 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb..8d31110b87 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index d0c9f9a67f..f7891faa23 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e8..0b7582accb 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1995,16 +1995,13 @@ ERROR:  cannot attach a permanent relation as partition of temporary relation "t
 DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
-privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 NOTICE:  drop cascades to 5 other objects
 DROP SERVER s8 CASCADE;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9..bad5199d9e 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796..19fbfa8b72 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
+--
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
 create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5..4487421ef3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
diff --git a/src/test/regress/sql/foreign_data.sql b/src/test/regress/sql/foreign_data.sql
index d6fb3fae4e..1cc1f6e012 100644
--- a/src/test/regress/sql/foreign_data.sql
+++ b/src/test/regress/sql/foreign_data.sql
@@ -805,11 +805,11 @@ DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 DROP SERVER t1 CASCADE;
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 DROP SERVER s8 CASCADE;
 \set VERBOSITY default
-- 
2.17.1

#115

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Peter Geoghegan (#114)

2 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On 18/03/2019 02:59, Peter Geoghegan wrote:

On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Actually, how about "rootsearch", or "rootdescend"? You're supposed to
hyphenate "re-find", and so it doesn't really work as a function
argument name.

Works for me.

Attached is v18 of patch series, which calls the new verification
option "rootdescend" verification.

Thanks!

I'm getting a regression failure from the 'create_table' test with this:

--- /home/heikki/git-sandbox/postgresql/src/test/regress/expected/create_table.out      2019-03-11 14:41:41.382759197 +0200
+++ /home/heikki/git-sandbox/postgresql/src/test/regress/results/create_table.out       2019-03-18 13:49:49.480249055 +0200
@@ -413,18 +413,17 @@
c text,
d text
) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "C");
+ERROR:  function plusone(integer) does not exist
+HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

Are you seeing that?

Looking at the patches 1 and 2 again:

I'm still not totally happy with the program flow and all the conditions
in _bt_findsplitloc(). I have a hard time following which codepaths are
taken when. I refactored that, so that there is a separate copy of the
loop for V3 and V4 indexes. So, when the code used to be something like
this:

_bt_findsplitloc(...)
{
...

/* move right, if needed */
for(;;)
{
/*
* various conditions for when to stop. Different conditions
* apply depending on whether it's a V3 or V4 index.
*/
}

...
}

it is now:

_bt_findsplitloc(...)
{
...

if (heapkeyspace)
{
/*
* If 'checkingunique', move right to the correct page.
*/
for (;;)
{
...
}
}
else
{
/*
* Move right, until we find a page with enough space or "get
* tired"
*/
for (;;)
{
...
}
}

...
}

I think this makes the logic easier to understand. Although there is
some commonality, the conditions for when to move right on a V3 vs V4
index are quite different, so it seems better to handle them separately.
There is some code duplication, but it's not too bad. I moved the common
code to step to the next page to a new helper function, _bt_stepright(),
which actually seems like a good idea in any case.

See attached patches with those changes, plus some minor comment
kibitzing. It's still failing the 'create_table' regression test, though.

- Heikki

PS. The commit message of the first patch needs updating, now that
BTInsertState is different from BTScanInsert.

Attachments:

v18-heikki-0001-Refactor-nbtree-insertion-scankeys.patchtext/x-patch; name=v18-heikki-0001-Refactor-nbtree-insertion-scankeys.patchDownload

From ae6ed703483f5eeef755462716d0c2cf91c74eae Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 29 Dec 2018 15:34:48 -0800
Subject: [PATCH 1/2] Refactor nbtree insertion scankeys.

Use dedicated struct to represent nbtree insertion scan keys.  Having a
dedicated struct makes the difference between search type scankeys and
insertion scankeys a lot clearer, and simplifies the signature of
several related functions.

Use the new struct to store mutable state about an in-progress binary
search, rather than having _bt_check_unique() callers cache their binary
search in an ad-hoc manner.  This makes it easy to add a new
optimization: _bt_check_unique() now falls out of its loop immediately
in the common case where it's already clear that there couldn't possibly
be a duplicate.  More importantly, the new _bt_check_unique() scheme
makes it a lot easier to manage cached binary search effort afterwards,
from within _bt_findinsertloc().  This is needed for the upcoming patch
to make nbtree tuples unique by treating heap TID as a final tiebreaker
column.

Based on a suggestion by Andrey Lepikhov.

Author: Peter Geoghegan, Heikki Linnakangas
Reviewed-By: Heikki Linnakangas, Andrey Lepikhov
Discussion: https://postgr.es/m/CAH2-WzmE6AhUdk9NdWBf4K3HjWXZBX3+umC7mH7+WDrKcRtsOw@mail.gmail.com
---
 contrib/amcheck/verify_nbtree.c       |  52 ++--
 src/backend/access/nbtree/README      |  29 +-
 src/backend/access/nbtree/nbtinsert.c | 404 ++++++++++++++------------
 src/backend/access/nbtree/nbtpage.c   |  12 +-
 src/backend/access/nbtree/nbtsearch.c | 230 ++++++++++-----
 src/backend/access/nbtree/nbtsort.c   |   8 +-
 src/backend/access/nbtree/nbtutils.c  |  94 ++----
 src/backend/utils/sort/tuplesort.c    |  16 +-
 src/include/access/nbtree.h           |  80 ++++-
 9 files changed, 534 insertions(+), 391 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index bb6442de82d..5426bfd8d87 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -127,9 +127,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
-static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+static BTScanInsert bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
@@ -139,14 +139,14 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber upperbound);
 static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 BTScanInsert key,
 					 OffsetNumber lowerbound);
 static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
+							   BTScanInsert key,
+							   Page nontarget,
 							   OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
@@ -838,8 +838,8 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		BTScanInsert skey;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -1030,7 +1030,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
-			ScanKey		rightkey;
+			BTScanInsert	rightkey;
 
 			/* Get item in next/right page */
 			rightkey = bt_right_page_check_scankey(state);
@@ -1082,7 +1082,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, skey, childblock);
 		}
 	}
 
@@ -1111,11 +1111,12 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
+static BTScanInsert
 bt_right_page_check_scankey(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
+	IndexTuple	firstitup;
 	BlockNumber targetnext;
 	Page		rightpage;
 	OffsetNumber nline;
@@ -1303,8 +1304,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
+	return _bt_mkscankey(state->rel, firstitup);
 }
 
 /*
@@ -1317,8 +1318,8 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  * verification this way around is much more practical.
  */
 static void
-bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
+				  BlockNumber childblock)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1423,8 +1424,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1864,13 +1864,12 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
@@ -1883,13 +1882,12 @@ invariant_leq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
 					 OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
 	return cmp >= 0;
 }
@@ -1905,14 +1903,12 @@ invariant_geq_offset(BtreeCheckState *state, ScanKey key,
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							   Page nontarget, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
 	return cmp <= 0;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index b0b4ab8b766..a295a7a286d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -598,19 +598,22 @@ scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
-insertion scankey uses the same array-of-ScanKey data structure, but the
-sk_func pointers point to btree comparison support functions (ie, 3-way
-comparators that return int4 values interpreted as <0, =0, >0).  In an
-insertion scankey there is exactly one entry per index column.  Insertion
-scankeys are built within the btree code (eg, by _bt_mkscankey()) and are
-used to locate the starting point of a scan, as well as for locating the
-place to insert a new index tuple.  (Note: in the case of an insertion
-scankey built from a search scankey, there might be fewer keys than
-index columns, indicating that we have no constraints for the remaining
-index columns.)  After we have located the starting point of a scan, the
-original search scankey is consulted as each index entry is sequentially
-scanned to decide whether to return the entry and whether the scan can
-stop (see _bt_checkkeys()).
+insertion scankey ("BTScanInsert" data structure) uses a similar
+array-of-ScanKey data structure, but the sk_func pointers point to btree
+comparison support functions (ie, 3-way comparators that return int4 values
+interpreted as <0, =0, >0).  In an insertion scankey there is at most one
+entry per index column.  There is also other data about the rules used to
+locate where to begin the scan, such as whether or not the scan is a
+"nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
+_bt_mkscankey()) and are used to locate the starting point of a scan, as
+well as for locating the place to insert a new index tuple.  (Note: in the
+case of an insertion scankey built from a search scankey or built from a
+truncated pivot tuple, there might be fewer keys than index columns,
+indicating that we have no constraints for the remaining index columns.)
+After we have located the starting point of a scan, the original search
+scankey is consulted as each index entry is sequentially scanned to decide
+whether to return the entry and whether the scan can stop (see
+_bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
 to heap tuples, but rather used for tree navigation.  Pivot tuples includes
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 2997b1111a2..41aaa1bdd46 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -51,19 +51,16 @@ typedef struct
 
 static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
 
-static TransactionId _bt_check_unique(Relation rel, IndexTuple itup,
-				 Relation heapRel, Buffer buf, OffsetNumber offset,
-				 ScanKey itup_scankey,
+static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
+				 Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken);
-static void _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+static OffsetNumber _bt_findinsertloc(Relation rel,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel);
+static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
 static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
@@ -83,8 +80,8 @@ static void _bt_checksplitloc(FindSplitData *state,
 				  int dataitemstoleft, Size firstoldonrightsz);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
-static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey);
+static bool _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key,
+			Page page, OffsetNumber offnum);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 /*
@@ -110,18 +107,26 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel)
 {
 	bool		is_unique = false;
-	int			indnkeyatts;
-	ScanKey		itup_scankey;
+	BTInsertStateData insertstate;
+	BTScanInsert itup_key;
 	BTStack		stack = NULL;
 	Buffer		buf;
-	OffsetNumber offset;
 	bool		fastpath;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	Assert(indnkeyatts != 0);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_scankey = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup);
+
+	/*
+	 * Fill in the BTInsertState working area, to track the current page and
+	 * position within the page to insert on
+	 */
+	insertstate.itup = itup;
+	/* PageAddItem will MAXALIGN(), but be consistent */
+	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+	insertstate.itup_key = itup_key;
+	insertstate.bounds_valid = false;
+	insertstate.buf = InvalidBuffer;
 
 	/*
 	 * It's very common to have an index on an auto-incremented or
@@ -144,10 +149,8 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	 */
 top:
 	fastpath = false;
-	offset = InvalidOffsetNumber;
 	if (RelationGetTargetBlock(rel) != InvalidBlockNumber)
 	{
-		Size		itemsz;
 		Page		page;
 		BTPageOpaque lpageop;
 
@@ -166,9 +169,6 @@ top:
 			page = BufferGetPage(buf);
 
 			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			itemsz = IndexTupleSize(itup);
-			itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this
-										 * but we need to be consistent */
 
 			/*
 			 * Check if the page is still the rightmost leaf page, has enough
@@ -177,10 +177,9 @@ top:
 			 */
 			if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
 				!P_IGNORE(lpageop) &&
-				(PageGetFreeSpace(page) > itemsz) &&
+				(PageGetFreeSpace(page) > insertstate.itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
-							P_FIRSTDATAKEY(lpageop)) > 0)
+				_bt_compare(rel, itup_key, page, P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
 				 * The right-most block should never have an incomplete split.
@@ -219,10 +218,12 @@ top:
 		 * Find the first page containing this key.  Buffer returned by
 		 * _bt_search() is locked in exclusive mode.
 		 */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, itup_key, &buf, BT_WRITE, NULL);
 	}
 
+	insertstate.buf = buf;
+	buf = InvalidBuffer;		/* insertstate.buf now owns the buffer */
+
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
 	 * the index.
@@ -244,19 +245,19 @@ top:
 	 * let the tuple in and return false for possibly non-unique, or true for
 	 * definitely unique.
 	 */
-	if (checkUnique != UNIQUE_CHECK_NO)
+	if (checkingunique)
 	{
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
-		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
-								 checkUnique, &is_unique, &speculativeToken);
+		xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
+								 &is_unique, &speculativeToken);
 
 		if (TransactionIdIsValid(xwait))
 		{
 			/* Have to wait for the other guy ... */
-			_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, insertstate.buf);
+			insertstate.buf = InvalidBuffer;
 
 			/*
 			 * If it's a speculative insertion, wait for it to finish (ie. to
@@ -277,6 +278,8 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		OffsetNumber newitemoff;
+
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -286,22 +289,28 @@ top:
 		 * This reasoning also applies to INCLUDE indexes, whose extra
 		 * attributes are not considered part of the key space.
 		 */
-		CheckForSerializableConflictIn(rel, NULL, buf);
-		/* do the insertion */
-		_bt_findinsertloc(rel, &buf, &offset, indnkeyatts, itup_scankey, itup,
-						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
+
+		/*
+		 * Do the insertion.  Note that insertstate contains cached binary
+		 * search bounds established within _bt_check_unique when insertion is
+		 * checkingunique.
+		 */
+		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
+									   stack, heapRel);
+		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
+					   newitemoff, false);
 	}
 	else
 	{
 		/* just release the buffer */
-		_bt_relbuf(rel, buf);
+		_bt_relbuf(rel, insertstate.buf);
 	}
 
 	/* be tidy */
 	if (stack)
 		_bt_freestack(stack);
-	_bt_freeskey(itup_scankey);
+	pfree(itup_key);
 
 	return is_unique;
 }
@@ -309,10 +318,6 @@ top:
 /*
  *	_bt_check_unique() -- Check for violation of unique index constraint
  *
- * offset points to the first possible item that could conflict. It can
- * also point to end-of-page, which means that the first tuple to check
- * is the first tuple on the next page.
- *
  * Returns InvalidTransactionId if there is no conflict, else an xact ID
  * we must wait for to see if it commits a conflicting tuple.   If an actual
  * conflict is detected, no return --- just ereport().  If an xact ID is
@@ -324,16 +329,21 @@ top:
  * InvalidTransactionId because we don't want to wait.  In this case we
  * set *is_unique to false if there is a potential conflict, and the
  * core code must redo the uniqueness check later.
+ *
+ * As a side-effect, sets state in insertstate that can later be used by
+ * _bt_findinsertloc() to reuse most of the binary search work we do
+ * here.
  */
 static TransactionId
-_bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
-				 Buffer buf, OffsetNumber offset, ScanKey itup_scankey,
+_bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 IndexUniqueCheck checkUnique, bool *is_unique,
 				 uint32 *speculativeToken)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	IndexTuple	itup = insertstate->itup;
+	BTScanInsert itup_key = insertstate->itup_key;
 	SnapshotData SnapshotDirty;
+	OffsetNumber offset;
 	OffsetNumber maxoff;
 	Page		page;
 	BTPageOpaque opaque;
@@ -345,13 +355,22 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 
 	InitDirtySnapshot(SnapshotDirty);
 
-	page = BufferGetPage(buf);
+	page = BufferGetPage(insertstate->buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+	/*
+	 * Find the first tuple with the same key.
+	 *
+	 * This also saves the binary search bounds in insertstate.  We use them
+	 * in the fastpath below, but also in the _bt_findinsertloc() call later.
+	 */
+	offset = _bt_binsrch_insert(rel, insertstate);
+
 	/*
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
+	Assert(!insertstate->bounds_valid || insertstate->low == offset);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -364,21 +383,40 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 		 */
 		if (offset <= maxoff)
 		{
+			/*
+			 * Fastpath: In most cases, we can use cached search bounds to
+			 * limit our consideration to items that are definitely
+			 * duplicates.  This fastpath doesn't apply when the original page
+			 * is empty, or when initial offset is past the end of the
+			 * original page, which may indicate that we need to examine a
+			 * second or subsequent page.
+			 *
+			 * Note that this optimization avoids calling _bt_isequal()
+			 * entirely when there are no duplicates, as long as the offset
+			 * where the key will go is not at the end of the page.
+			 */
+			if (insertstate->bounds_valid && offset == insertstate->stricthigh)
+			{
+				Assert(nbuf == InvalidBuffer);
+				Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
+				Assert(insertstate->low <= insertstate->stricthigh);
+				Assert(!_bt_isequal(itupdesc, itup_key, page, offset));
+				break;
+			}
+
 			curitemid = PageGetItemId(page, offset);
 
 			/*
 			 * We can skip items that are marked killed.
 			 *
-			 * Formerly, we applied _bt_isequal() before checking the kill
-			 * flag, so as to fall out of the item loop as soon as possible.
-			 * However, in the presence of heavy update activity an index may
-			 * contain many killed items with the same key; running
-			 * _bt_isequal() on each killed item gets expensive. Furthermore
-			 * it is likely that the non-killed version of each key appears
-			 * first, so that we didn't actually get to exit any sooner
-			 * anyway. So now we just advance over killed items as quickly as
-			 * we can. We only apply _bt_isequal() when we get to a non-killed
-			 * item or the end of the page.
+			 * In the presence of heavy update activity an index may contain
+			 * many killed items with the same key; running _bt_isequal() on
+			 * each killed item gets expensive.  Just advance over killed
+			 * items as quickly as we can.  We only apply _bt_isequal() when
+			 * we get to a non-killed item.  Even those comparisons could be
+			 * avoided (in the common case where there is only one page to
+			 * visit) by reusing bounds, but just skipping dead items is fast
+			 * enough.
 			 */
 			if (!ItemIdIsDead(curitemid))
 			{
@@ -391,7 +429,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 				 * in real comparison, but only for ordering/finding items on
 				 * pages. - vadim 03/24/97
 				 */
-				if (!_bt_isequal(itupdesc, page, offset, indnkeyatts, itup_scankey))
+				if (!_bt_isequal(itupdesc, itup_key, page, offset))
 					break;		/* we're past all the equal tuples */
 
 				/* okay, we gotta fetch the heap tuple ... */
@@ -488,7 +526,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 * otherwise be masked by this unique constraint
 					 * violation.
 					 */
-					CheckForSerializableConflictIn(rel, NULL, buf);
+					CheckForSerializableConflictIn(rel, NULL, insertstate->buf);
 
 					/*
 					 * This is a definite conflict.  Break the tuple down into
@@ -500,7 +538,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					 */
 					if (nbuf != InvalidBuffer)
 						_bt_relbuf(rel, nbuf);
-					_bt_relbuf(rel, buf);
+					_bt_relbuf(rel, insertstate->buf);
+					insertstate->buf = InvalidBuffer;
 
 					{
 						Datum		values[INDEX_MAX_KEYS];
@@ -540,7 +579,7 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 					if (nbuf != InvalidBuffer)
 						MarkBufferDirtyHint(nbuf, true);
 					else
-						MarkBufferDirtyHint(buf, true);
+						MarkBufferDirtyHint(insertstate->buf, true);
 				}
 			}
 		}
@@ -552,11 +591,14 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
+			int			highkeycmp;
+
 			/* If scankey == hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
+			Assert(highkeycmp <= 0);
+			if (highkeycmp != 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -600,57 +642,42 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in fact,
- *		if the new key is equal to the page's "high key" we can place it on
- *		the next page.  If it is equal to the high key, and there's not room
- *		to insert the new tuple on the current page without splitting, then
- *		we can move right hoping to find more free space and avoid a split.
- *		(We should not move right indefinitely, however, since that leads to
- *		O(N^2) insertion behavior in the presence of many equal keys.)
- *		Once we have chosen the page to put the key on, we'll insert it before
- *		any existing equal keys because of the way _bt_binsrch() works.
+ *		On entry, insertstate buffer contains the first legal page the new
+ *		tuple could be inserted to.  It is exclusive-locked and pinned by the
+ *		caller.
  *
- *		If there's not enough room in the space, we try to make room by
- *		removing any LP_DEAD tuples.
+ *		If the new key is equal to one or more existing keys, we can
+ *		legitimately place it anywhere in the series of equal keys --- in
+ *		fact, if the new key is equal to the page's "high key" we can place it
+ *		on the next page.  If it is equal to the high key, and there's not
+ *		room to insert the new tuple on the current page without splitting,
+ *		then we can move right hoping to find more free space and avoid a
+ *		split.  Furthermore, if there's not enough room on a page, we try to
+ *		make room by removing any LP_DEAD tuples.
  *
- *		On entry, *bufptr and *offsetptr point to the first legal position
- *		where the new tuple could be inserted.  The caller should hold an
- *		exclusive lock on *bufptr.  *offsetptr can also be set to
- *		InvalidOffsetNumber, in which case the function will search for the
- *		right location within the page if needed.  On exit, they point to the
- *		chosen insert location.  If _bt_findinsertloc decides to move right,
- *		the lock and pin on the original page will be released and the new
- *		page returned to the caller is exclusively locked instead.
+ *		On exit, insertstate buffer contains the chosen insertion page, and
+ *		the offset within that page is returned.  If _bt_findinsertloc decides
+ *		to move right, the lock and pin on the original page are released, and
+ *		the new buffer is exclusively locked and pinned instead.
  *
- *		newtup is the new tuple we're inserting, and scankey is an insertion
- *		type scan key for it.
+ *		If insertstate contains cached binary search bounds, we will take
+ *		advantage of them.  This avoids repeating comparisons that we made in
+ *		_bt_check_unique() already.
  */
-static void
+static OffsetNumber
 _bt_findinsertloc(Relation rel,
-				  Buffer *bufptr,
-				  OffsetNumber *offsetptr,
-				  int keysz,
-				  ScanKey scankey,
-				  IndexTuple newtup,
+				  BTInsertState insertstate,
+				  bool checkingunique,
 				  BTStack stack,
 				  Relation heapRel)
 {
-	Buffer		buf = *bufptr;
-	Page		page = BufferGetPage(buf);
-	Size		itemsz;
+	BTScanInsert itup_key = insertstate->itup_key;
+	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
-	bool		movedright,
-				vacuumed;
 	OffsetNumber newitemoff;
-	OffsetNumber firstlegaloff = *offsetptr;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	itemsz = IndexTupleSize(newtup);
-	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
-								 * need to be consistent */
-
 	/*
 	 * Check whether the item can fit on a btree page at all. (Eventually, we
 	 * ought to try to apply TOAST methods if not.) We actually need to be
@@ -660,11 +687,11 @@ _bt_findinsertloc(Relation rel,
 	 *
 	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
 	 */
-	if (itemsz > BTMaxItemSize(page))
+	if (insertstate->itemsz > BTMaxItemSize(page))
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itemsz, BTMaxItemSize(page),
+						insertstate->itemsz, BTMaxItemSize(page),
 						RelationGetRelationName(rel)),
 				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 						 "Consider a function index of an MD5 hash of the value, "
@@ -690,100 +717,114 @@ _bt_findinsertloc(Relation rel,
 	 * excellent job of preventing O(N^2) behavior with many equal keys.
 	 *----------
 	 */
-	movedright = false;
-	vacuumed = false;
-	while (PageGetFreeSpace(page) < itemsz)
+	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
+	Assert(!insertstate->bounds_valid || checkingunique);
+	while (PageGetFreeSpace(page) < insertstate->itemsz)
 	{
-		Buffer		rbuf;
-		BlockNumber rblkno;
-
 		/*
 		 * before considering moving right, see if we can obtain enough space
 		 * by erasing LP_DEAD items
 		 */
 		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
 		{
-			_bt_vacuum_one_page(rel, buf, heapRel);
+			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;
 
-			/*
-			 * remember that we vacuumed this page, because that makes the
-			 * hint supplied by the caller invalid
-			 */
-			vacuumed = true;
-
-			if (PageGetFreeSpace(page) >= itemsz)
+			if (PageGetFreeSpace(page) >= insertstate->itemsz)
 				break;			/* OK, now we have enough space */
 		}
 
 		/*
 		 * nope, so check conditions (b) and (c) enumerated above
+		 *
+		 * An earlier _bt_check_unique() call may well have established bounds
+		 * that we can use to skip the high key check for checkingunique
+		 * callers.  This fastpath cannot be used when there are no items on
+		 * the existing page (other than high key), or when it looks like the
+		 * new item belongs last on the page, but it might go on a later page
+		 * instead.
 		 */
+		if (insertstate->bounds_valid &&
+			insertstate->low <= insertstate->stricthigh &&
+			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+			break;
+
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
+			_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
 			random() <= (MAX_RANDOM_VALUE / 100))
 			break;
 
-		/*
-		 * step right to next non-dead page
-		 *
-		 * must write-lock that page before releasing write lock on current
-		 * page; else someone else's _bt_check_unique scan could fail to see
-		 * our insertion.  write locks on intermediate dead pages won't do
-		 * because we don't know when they will get de-linked from the tree.
-		 */
-		rbuf = InvalidBuffer;
+		_bt_stepright(rel, insertstate, stack);
+		page = BufferGetPage(insertstate->buf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	}
 
-		rblkno = lpageop->btpo_next;
-		for (;;)
-		{
-			rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
-			page = BufferGetPage(rbuf);
-			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* Loop should not break until correct page located */
+	Assert(P_RIGHTMOST(lpageop) ||
+		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-			/*
-			 * If this page was incompletely split, finish the split now. We
-			 * do this while holding a lock on the left sibling, which is not
-			 * good because finishing the split could be a fairly lengthy
-			 * operation.  But this should happen very seldom.
-			 */
-			if (P_INCOMPLETE_SPLIT(lpageop))
-			{
-				_bt_finish_split(rel, rbuf, stack);
-				rbuf = InvalidBuffer;
-				continue;
-			}
+	/* Find new item offset, possibly reusing earlier search bounds */
+	newitemoff = _bt_binsrch_insert(rel, insertstate);
 
-			if (!P_IGNORE(lpageop))
-				break;
-			if (P_RIGHTMOST(lpageop))
-				elog(ERROR, "fell off the end of index \"%s\"",
-					 RelationGetRelationName(rel));
+	return newitemoff;
+}
+
+/*
+ * Step right to next non-dead page, during insertion.
+ *
+ * This is a bit more complicated than moving right in a search.  We must
+ * write-lock the target page before releasing write lock on current page;
+ * else someone else's _bt_check_unique scan could fail to see our insertion.
+ * Write locks on intermediate dead pages won't do because we don't know when
+ * they will get de-linked from the tree.
+ *
+ * (this is more aggressive than it needs to be for non-unique
+ * !heapkeyspace indexes.)
+ */
+static void
+_bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
+{
+	Page		page;
+	BTPageOpaque lpageop;
+	Buffer		rbuf;
+	BlockNumber rblkno;
+
+	page = BufferGetPage(insertstate->buf);
+	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-			rblkno = lpageop->btpo_next;
+	rbuf = InvalidBuffer;
+	rblkno = lpageop->btpo_next;
+	for (;;)
+	{
+		rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
+		page = BufferGetPage(rbuf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		/*
+		 * If this page was incompletely split, finish the split now.  We do
+		 * this while holding a lock on the left sibling, which is not good
+		 * because finishing the split could be a fairly lengthy operation.
+		 * But this should happen very seldom.
+		 */
+		if (P_INCOMPLETE_SPLIT(lpageop))
+		{
+			_bt_finish_split(rel, rbuf, stack);
+			rbuf = InvalidBuffer;
+			continue;
 		}
-		_bt_relbuf(rel, buf);
-		buf = rbuf;
-		movedright = true;
-		vacuumed = false;
-	}
 
-	/*
-	 * Now we are on the right page, so find the insert position. If we moved
-	 * right at all, we know we should insert at the start of the page. If we
-	 * didn't move right, we can use the firstlegaloff hint if the caller
-	 * supplied one, unless we vacuumed the page which might have moved tuples
-	 * around making the hint invalid. If we didn't move right or can't use
-	 * the hint, find the position by searching.
-	 */
-	if (movedright)
-		newitemoff = P_FIRSTDATAKEY(lpageop);
-	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
-		newitemoff = firstlegaloff;
-	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		if (!P_IGNORE(lpageop))
+			break;
+		if (P_RIGHTMOST(lpageop))
+			elog(ERROR, "fell off the end of index \"%s\"",
+				 RelationGetRelationName(rel));
 
-	*bufptr = buf;
-	*offsetptr = newitemoff;
+		rblkno = lpageop->btpo_next;
+	}
+	/* rbuf locked, local state set up; unlock buf, update other state */
+	_bt_relbuf(rel, insertstate->buf);
+	insertstate->buf = rbuf;
+	insertstate->bounds_valid = false;
 }
 
 /*----------
@@ -2312,24 +2353,21 @@ _bt_pgaddtup(Page page,
  * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
-_bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
-			int keysz, ScanKey scankey)
+_bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
+			OffsetNumber offnum)
 {
 	IndexTuple	itup;
+	ScanKey		scankey;
 	int			i;
 
-	/* Better be comparing to a leaf item */
+	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
 
+	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
-	for (i = 1; i <= keysz; i++)
+	for (i = 1; i <= itup_key->keysz; i++)
 	{
 		AttrNumber	attno;
 		Datum		datum;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c785bca95e..56041c3d383 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1371,7 +1371,7 @@ _bt_pagedel(Relation rel, Buffer buf)
 			 */
 			if (!stack)
 			{
-				ScanKey		itup_scankey;
+				BTScanInsert itup_key;
 				ItemId		itemid;
 				IndexTuple	targetkey;
 				Buffer		lbuf;
@@ -1421,12 +1421,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_scankey = _bt_mkscankey(rel, targetkey);
-				/* find the leftmost leaf page containing this key */
-				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
-				/* don't need a pin on the page */
+				itup_key = _bt_mkscankey(rel, targetkey);
+				/* get stack to leaf page by searching index */
+				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
+				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
 
 				/*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index af3da3aa5b6..0305469ad0a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -24,6 +24,8 @@
 #include "utils/rel.h"
 
 
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -34,7 +36,6 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 					  ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
@@ -66,18 +67,13 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 	}
 }
 
-
 /*
  *	_bt_search() -- Search the tree for a particular scankey,
  *		or more precisely for the first leaf page it could be on.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
+ * The passed scankey is an insertion-type scankey (see nbtree/README),
  * but it can omit the rightmost column(s) of the index.
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * Return value is a stack of parent-page pointers.  *bufP is set to the
  * address of the leaf-page buffer, which is read-locked and pinned.
  * No locks are held on the parent pages, however!
@@ -93,8 +89,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
+		   Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
@@ -130,8 +126,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  (access == BT_WRITE), stack_in,
+		*bufP = _bt_moveright(rel, key, *bufP, (access == BT_WRITE), stack_in,
 							  page_access, snapshot);
 
 		/* if this is a leaf page, we're done */
@@ -144,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, key, *bufP);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -198,8 +193,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
-							  true, stack_in, BT_WRITE, snapshot);
+		*bufP = _bt_moveright(rel, key, *bufP, true, stack_in, BT_WRITE,
+							  snapshot);
 	}
 
 	return stack_in;
@@ -215,16 +210,17 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  * or strictly to the right of it.
  *
  * This routine decides whether or not we need to move right in the
- * tree by examining the high key entry on the page.  If that entry
- * is strictly less than the scankey, or <= the scankey in the nextkey=true
- * case, then we followed the wrong link and we need to move right.
+ * tree by examining the high key entry on the page.  If that entry is
+ * strictly less than the scankey, or <= the scankey in the
+ * key.nextkey=true case, then we followed the wrong link and we need
+ * to move right.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ * The passed insertion-type scankey can omit the rightmost column(s) of the
+ * index. (see nbtree/README)
  *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
+ * When key.nextkey is false (the usual case), we are looking for the first
+ * item >= key.  When key.nextkey is true, we are looking for the first item
+ * strictly greater than key.
  *
  * If forupdate is true, we will attempt to finish any incomplete splits
  * that we encounter.  This is required when locking a target page for an
@@ -241,10 +237,8 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
  */
 Buffer
 _bt_moveright(Relation rel,
+			  BTScanInsert key,
 			  Buffer buf,
-			  int keysz,
-			  ScanKey scankey,
-			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
 			  int access,
@@ -269,7 +263,7 @@ _bt_moveright(Relation rel,
 	 * We also have to move right if we followed a link that brought us to a
 	 * dead page.
 	 */
-	cmpval = nextkey ? 0 : 1;
+	cmpval = key->nextkey ? 0 : 1;
 
 	for (;;)
 	{
@@ -304,7 +298,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,13 +318,6 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
- *
- * When nextkey is false (the usual case), we are looking for the first
- * item >= scankey.  When nextkey is true, we are looking for the first
- * item strictly greater than scankey.
- *
  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
  * key >= given scankey, or > scankey if nextkey is true.  (NOTE: in
  * particular, this means it is possible to return a value 1 greater than the
@@ -348,12 +335,10 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-OffsetNumber
+static OffsetNumber
 _bt_binsrch(Relation rel,
-			Buffer buf,
-			int keysz,
-			ScanKey scankey,
-			bool nextkey)
+			BTScanInsert key,
+			Buffer buf)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -375,7 +360,7 @@ _bt_binsrch(Relation rel,
 	 * This can never happen on an internal page, however, since they are
 	 * never empty (an internal page must have children).
 	 */
-	if (high < low)
+	if (unlikely(high < low))
 		return low;
 
 	/*
@@ -392,7 +377,7 @@ _bt_binsrch(Relation rel,
 	 */
 	high++;						/* establish the loop invariant for high */
 
-	cmpval = nextkey ? 0 : 1;	/* select comparison value */
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
 
 	while (high > low)
 	{
@@ -400,7 +385,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, key, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -427,14 +412,120 @@ _bt_binsrch(Relation rel,
 	return OffsetNumberPrev(low);
 }
 
-/*----------
- *	_bt_compare() -- Compare scankey to a particular tuple on the page.
+/*
  *
- * The passed scankey must be an insertion-type scankey (see nbtree/README),
- * but it can omit the rightmost column(s) of the index.
+ *	bt_binsrch_insert() -- Cacheable, incremental leaf page binary search.
+ *
+ * Like _bt_binsrch(), but with support for caching the binary search
+ * bounds.  Only used during insertion, and only on the leaf page that it
+ * looks like caller will insert tuple on.  Exclusive-locked and pinned
+ * leaf page is contained within insertstate.
+ *
+ * Caches the bounds fields in insertstate so that a subsequent call can
+ * reuse the low and strict high bounds of original binary search.  Callers
+ * that use these fields directly must be prepared for the case where low
+ * and/or stricthigh are not on the same page (one or both exceed maxoff
+ * for the page).  The case where there are no items on the page (high <
+ * low) makes bounds invalid.
+ *
+ * Caller is responsible for invalidating bounds when it modifies the page
+ * before calling here a second time.
+ */
+OffsetNumber
+_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
+{
+	BTScanInsert key = insertstate->itup_key;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber low,
+				high,
+				stricthigh;
+	int32		result,
+				cmpval;
+
+	page = BufferGetPage(insertstate->buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque));
+	Assert(!key->nextkey);
+
+	if (!insertstate->bounds_valid)
+	{
+		/* Start new binary search */
+		low = P_FIRSTDATAKEY(opaque);
+		high = PageGetMaxOffsetNumber(page);
+	}
+	else
+	{
+		/* Restore result of previous binary search against same page */
+		low = insertstate->low;
+		high = insertstate->stricthigh;
+	}
+
+	/* If there are no keys on the page, return the first available slot */
+	if (unlikely(high < low))
+	{
+		/* Caller can't reuse bounds */
+		insertstate->low = InvalidOffsetNumber;
+		insertstate->stricthigh = InvalidOffsetNumber;
+		insertstate->bounds_valid = false;
+		return low;
+	}
+
+	/*
+	 * Binary search to find the first key on the page >= scan key.
+	 * (nextkey is always false when inserting).
+	 *
+	 * The loop invariant is: all slots before 'low' are < scan key, all
+	 * slots at or after 'high' are >= scan key.  'stricthigh' is > scan
+	 * key, and is maintained to save additional search effort for caller.
+	 *
+	 * We can fall out when high == low.
+	 */
+	if (!insertstate->bounds_valid)
+		high++;					/* establish the loop invariant for high */
+	stricthigh = high;			/* high initially strictly higher */
+
+	cmpval = 1;					/* !nextkey comparison value */
+
+	while (high > low)
+	{
+		OffsetNumber mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		result = _bt_compare(rel, key, page, mid);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+		{
+			high = mid;
+			if (result != 0)
+				stricthigh = high;
+		}
+	}
+
+	/*
+	 * On a leaf page, a binary search always returns the first key >= scan
+	 * key (at least in !nextkey case), which could be the last slot + 1.
+	 * This is also the lower bound of cached search.
+	 *
+	 * stricthigh may also be the last slot + 1, which prevents caller from
+	 * using bounds directly, but is still useful to us if we're called a
+	 * second time with cached bounds (cached low will be < stricthigh when
+	 * that happens).
+	 */
+	insertstate->low = low;
+	insertstate->stricthigh = stricthigh;
+	insertstate->bounds_valid = true;
+
+	return low;
+}
+
+/*----------
+ *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
- *	keysz: number of key conditions to be checked (might be less than the
- *		number of index columns!)
  *	page/offnum: location of btree item to be compared to.
  *
  *		This routine returns:
@@ -447,25 +538,26 @@ _bt_binsrch(Relation rel,
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
- * scankey.  The actual key value stored (if any, which there probably isn't)
- * does not matter.  This convention allows us to implement the Lehman and
- * Yao convention that the first down-link pointer is before the first key.
- * See backend/access/nbtree/README for details.
+ * scankey.  The actual key value stored is explicitly truncated to 0
+ * attributes (explicitly minus infinity) with version 3+ indexes, but
+ * that isn't relied upon.  This allows us to implement the Lehman and
+ * Yao convention that the first down-link pointer is before the first
+ * key.  See backend/access/nbtree/README for details.
  *----------
  */
 int32
 _bt_compare(Relation rel,
-			int keysz,
-			ScanKey scankey,
+			BTScanInsert key,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
-	int			i;
+	ScanKey		scankey;
 
 	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -488,7 +580,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	scankey = key->scankeys;
+	for (int i = 1; i <= key->keysz; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -574,8 +667,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	StrategyNumber strat;
 	bool		nextkey;
 	bool		goback;
+	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
-	ScanKeyData scankeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
@@ -821,8 +914,9 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
 	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built in the local
-	 * scankeys[] array, using the keys identified by startKeys[].
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
 	 */
 	Assert(keysCount <= INDEX_MAX_KEYS);
 	for (i = 0; i < keysCount; i++)
@@ -850,7 +944,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				_bt_parallel_done(scan);
 				return false;
 			}
-			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
+			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
 			 * If the row comparison is the last positioning key we accepted,
@@ -882,7 +976,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					if (subkey->sk_flags & SK_ISNULL)
 						break;	/* can't use null keys */
 					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(scankeys + keysCount, subkey, sizeof(ScanKeyData));
+					memcpy(inskey.scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
 					keysCount++;
 					if (subkey->sk_flags & SK_ROW_END)
 					{
@@ -928,7 +1023,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 				FmgrInfo   *procinfo;
 
 				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(scankeys + i,
+				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
 											   cur->sk_flags,
 											   cur->sk_attno,
 											   InvalidStrategy,
@@ -949,7 +1044,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
 						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
 						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(scankeys + i,
+				ScanKeyEntryInitialize(inskey.scankeys + i,
 									   cur->sk_flags,
 									   cur->sk_attno,
 									   InvalidStrategy,
@@ -1052,12 +1147,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 			return false;
 	}
 
+	/* Initialize remaining insertion scan key fields */
+	inskey.nextkey = nextkey;
+	inskey.keysz = keysCount;
+
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
-					   scan->xs_snapshot);
+	stack = _bt_search(rel, &inskey, &buf, BT_READ, scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
 	_bt_freestack(stack);
@@ -1086,7 +1184,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, &inskey, buf);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 363dceb5b1c..a0e2e70cefc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -263,6 +263,7 @@ typedef struct BTWriteState
 {
 	Relation	heap;
 	Relation	index;
+	BTScanInsert inskey;		/* generic insertion scankey */
 	bool		btws_use_wal;	/* dump pages to WAL? */
 	BlockNumber btws_pages_alloced; /* # pages allocated */
 	BlockNumber btws_pages_written; /* # pages written out */
@@ -540,6 +541,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
 	/*
 	 * We need to log index creation in WAL iff WAL archiving/streaming is
@@ -1085,7 +1087,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
 	SortSupport sortKeys;
 
 	if (merge)
@@ -1098,7 +1099,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		/* the preparation of merge */
 		itup = tuplesort_getindextuple(btspool->sortstate, true);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
 
 		/* Prepare SortSupport data for each column */
 		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
@@ -1106,7 +1106,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 		for (i = 0; i < keysz; i++)
 		{
 			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
+			ScanKey		scanKey = wstate->inskey->scankeys + i;
 			int16		strategy;
 
 			sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1125,8 +1125,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
 		}
 
-		_bt_freeskey(indexScanKey);
-
 		for (;;)
 		{
 			load1 = true;		/* load BTSpool next ? */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2c05fb5e451..898679b44ef 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,34 +56,37 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  Callers that don't
+ *		need to fill out the insertion scankey arguments (e.g. they use an
+ *		ad-hoc comparison routine) can pass a NULL index tuple.
  */
-ScanKey
+BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
+	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
-	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
+	int			tupnatts;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
-	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
+	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
 	 * (INCLUDE index) columns are always omitted from scan keys.
 	 */
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
+	key = palloc(offsetof(BTScanInsertData, scankeys) +
+				 sizeof(ScanKeyData) * indnkeyatts);
+	key->nextkey = false;
+	key->keysz = Min(indnkeyatts, tupnatts);
+	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
 		FmgrInfo   *procinfo;
@@ -96,56 +99,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
-		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
-		ScanKeyEntryInitializeWithInfo(&skey[i],
-									   flags,
-									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
-									   InvalidOid,
-									   rel->rd_indcollation[i],
-									   procinfo,
-									   arg);
-	}
-
-	return skey;
-}
-
-/*
- * _bt_mkscankey_nodata
- *		Build an insertion scan key that contains 3-way comparator routines
- *		appropriate to the key datatypes, but no comparison data.  The
- *		comparison data ultimately used must match the key datatypes.
- *
- *		The result cannot be used with _bt_compare(), unless comparison
- *		data is first stored into the key entries.  Currently this
- *		routine is only called by nbtsort.c and tuplesort.c, which have
- *		their own comparison routines.
- */
-ScanKey
-_bt_mkscankey_nodata(Relation rel)
-{
-	ScanKey		skey;
-	int			indnkeyatts;
-	int16	   *indoption;
-	int			i;
-
-	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
-	indoption = rel->rd_indoption;
-
-	skey = (ScanKey) palloc(indnkeyatts * sizeof(ScanKeyData));
-
-	for (i = 0; i < indnkeyatts; i++)
-	{
-		FmgrInfo   *procinfo;
-		int			flags;
 
 		/*
-		 * We can use the cached (default) support procs since no cross-type
-		 * comparison can be needed.
+		 * Key arguments built when caller provides no tuple are
+		 * defensively represented as NULL values.  They should never be
+		 * used.
 		 */
-		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		flags = SK_ISNULL | (indoption[i] << SK_BT_INDOPTION_SHIFT);
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
+		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
@@ -153,19 +120,10 @@ _bt_mkscankey_nodata(Relation rel)
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
-									   (Datum) 0);
+									   arg);
 	}
 
-	return skey;
-}
-
-/*
- * free a scan key made by either _bt_mkscankey or _bt_mkscankey_nodata.
- */
-void
-_bt_freeskey(ScanKey skey)
-{
-	pfree(skey);
+	return key;
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2946b47b465..16bda5c586a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -884,7 +884,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -919,7 +919,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -945,7 +945,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -964,7 +964,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -981,7 +981,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 {
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   randomAccess);
-	ScanKey		indexScanKey;
+	BTScanInsert indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
 
@@ -1014,7 +1014,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey_nodata(indexRel);
+	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
@@ -1023,7 +1023,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	for (i = 0; i < state->nKeys; i++)
 	{
 		SortSupport sortKey = state->sortKeys + i;
-		ScanKey		scanKey = indexScanKey + i;
+		ScanKey		scanKey = indexScanKey->scankeys + i;
 		int16		strategy;
 
 		sortKey->ssup_cxt = CurrentMemoryContext;
@@ -1042,7 +1042,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 		PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
 	}
 
-	_bt_freeskey(indexScanKey);
+	pfree(indexScanKey);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 60622ea7906..d7e83b0a9d2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -319,6 +319,66 @@ typedef struct BTStackData
 
 typedef BTStackData *BTStack;
 
+/*
+ * BTScanInsert is the btree-private state needed to find an initial position
+ * for an indexscan, or to insert new tuples -- an "insertion scankey" (not to
+ * be confused with a search scankey).  It's used to descend a B-Tree using
+ * _bt_search.
+ *
+ * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
+ * locate the first item >= scankey.  When nextkey is true, they will locate
+ * the first item > scan key.
+ *
+ * keysz is the number of insertion scankeys present.
+ *
+ * scankeys is an array of scan key entries for attributes that are compared.
+ * During insertion, there must be a scan key for every attribute, but when
+ * starting a regular index scan some can be omitted.  The array is used as a
+ * flexible array member, though it's sized in a way that makes it possible to
+ * use stack allocations.  See nbtree/README for full details.
+ */
+typedef struct BTScanInsertData
+{
+	/* State used to locate a position at the leaf level */
+	bool		nextkey;
+	int			keysz;			/* Size of scankeys */
+	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
+} BTScanInsertData;
+
+typedef BTScanInsertData *BTScanInsert;
+
+/*
+ * BTInsertStateData is a working area used during insertion.
+ *
+ * This is filled in after descending the tree to the first leaf page the new
+ * tuple might belong on.  Tracks the current position while performing
+ * uniqueness check, before we have determined which exact page to insert
+ * to.
+ *
+ * (This should be private to nbtinsert.c, but it's also used by
+ * _bt_binsrch_insert)
+ */
+typedef struct BTInsertStateData
+{
+	IndexTuple	itup;			/* Item we're inserting */
+	Size		itemsz;			/* Size of itup -- should be MAXALIGN()'d */
+	BTScanInsert itup_key;		/* Insertion scankey */
+
+	/* Buffer containing leaf page we're likely to insert itup on */
+	Buffer		buf;
+
+	/*
+	 * Cache of bounds within the current buffer.  Only used for insertions
+	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
+	 * _bt_findinsertloc for details.
+	 */
+	bool		bounds_valid;
+	OffsetNumber low;
+	OffsetNumber stricthigh;
+} BTInsertStateData;
+
+typedef BTInsertStateData *BTInsertState;
+
 /*
  * BTScanOpaqueData is the btree-private state needed for an indexscan.
  * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -558,16 +618,12 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
 /*
  * prototypes for functions in nbtsearch.c
  */
-extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
-extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+extern BTStack _bt_search(Relation rel, BTScanInsert key, Buffer *bufP,
+		   int access, Snapshot snapshot);
+extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
+extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -576,9 +632,7 @@ extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 /*
  * prototypes for functions in nbtutils.c
  */
-extern ScanKey _bt_mkscankey(Relation rel, IndexTuple itup);
-extern ScanKey _bt_mkscankey_nodata(Relation rel);
-extern void _bt_freeskey(ScanKey skey);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
-- 
2.20.1

v18-heikki-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchtext/x-patch; name=v18-heikki-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patchDownload

From e1aa608a558e6f7abd3f93b8a5aab941cc188c4d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 27 Apr 2018 12:47:39 -0700
Subject: [PATCH 2/2] Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com
---
 contrib/amcheck/expected/check_btree.out     |   5 +-
 contrib/amcheck/sql/check_btree.sql          |   5 +-
 contrib/amcheck/verify_nbtree.c              | 341 +++++++++++++--
 contrib/pageinspect/btreefuncs.c             |   2 +-
 contrib/pageinspect/expected/btree.out       |   2 +-
 contrib/pgstattuple/expected/pgstattuple.out |  10 +-
 doc/src/sgml/indices.sgml                    |  24 +-
 src/backend/access/common/indextuple.c       |   6 +-
 src/backend/access/nbtree/README             | 136 +++---
 src/backend/access/nbtree/nbtinsert.c        | 404 +++++++++++-------
 src/backend/access/nbtree/nbtpage.c          | 204 ++++++---
 src/backend/access/nbtree/nbtree.c           |   2 +-
 src/backend/access/nbtree/nbtsearch.c        | 106 ++++-
 src/backend/access/nbtree/nbtsort.c          |  91 ++--
 src/backend/access/nbtree/nbtutils.c         | 410 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c          |  47 +--
 src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
 src/backend/utils/sort/tuplesort.c           |  13 +-
 src/include/access/nbtree.h                  | 202 +++++++--
 src/include/access/nbtxlog.h                 |  35 +-
 src/test/regress/expected/btree_index.out    |  34 +-
 src/test/regress/expected/create_index.out   |  13 +-
 src/test/regress/expected/dependency.out     |   4 +-
 src/test/regress/expected/event_trigger.out  |   4 +-
 src/test/regress/expected/foreign_data.out   |   9 +-
 src/test/regress/expected/rowsecurity.out    |   4 +-
 src/test/regress/sql/btree_index.sql         |  37 +-
 src/test/regress/sql/create_index.sql        |  14 +-
 src/test/regress/sql/foreign_data.sql        |   2 +-
 29 files changed, 1620 insertions(+), 554 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index ef5c9e1a1c3..1e6079ddd29 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -130,9 +130,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
  bt_index_parent_check 
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
index 0ad1631476d..3f1e0d17efe 100644
--- a/contrib/amcheck/sql/check_btree.sql
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -82,9 +82,12 @@ SELECT bt_index_parent_check('bttest_multi_idx', true);
 --
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 SELECT bt_index_parent_check('delete_test_table_pkey', true);
 
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 5426bfd8d87..4363e6b82e7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -46,6 +46,8 @@ PG_MODULE_MAGIC;
  * block per level, which is bound by the range of BlockNumber:
  */
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
 
 /*
  * State associated with verifying a B-Tree index
@@ -67,6 +69,8 @@ typedef struct BtreeCheckState
 	/* B-Tree Index Relation and associated heap relation */
 	Relation	rel;
 	Relation	heaprel;
+	/* rel is heapkeyspace index? */
+	bool		heapkeyspace;
 	/* ShareLock held on heap/index, rather than AccessShareLock? */
 	bool		readonly;
 	/* Also verifying heap has no unindexed tuples? */
@@ -123,7 +127,7 @@ static void bt_index_check_internal(Oid indrelid, bool parentcheck,
 						bool heapallindexed);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, Relation heaprel,
-					 bool readonly, bool heapallindexed);
+					 bool heapkeyspace, bool readonly, bool heapallindexed);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -138,17 +142,22 @@ static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
 						   IndexTuple itup);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
 					 BTScanInsert key,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 BTScanInsert key,
-					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   BTScanInsert key,
-							   Page nontarget,
-							   OffsetNumber upperbound);
+static inline bool invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							 BTScanInsert key,
+							 Page nontarget,
+							 OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
+													IndexTuple itup);
+static inline ItemPointer BTreeTupleGetHeapTIDCareful(BtreeCheckState *state,
+							IndexTuple itup, bool nonpivot);
 
 /*
  * bt_index_check(index regclass, heapallindexed boolean)
@@ -205,6 +214,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	bool		heapkeyspace;
 	LOCKMODE	lockmode;
 
 	if (parentcheck)
@@ -255,7 +265,9 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed)
 	btree_index_checkable(indrel);
 
 	/* Check index, possibly against table it is an index on */
-	bt_check_every_level(indrel, heaprel, parentcheck, heapallindexed);
+	heapkeyspace = _bt_heapkeyspace(indrel);
+	bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
+						 heapallindexed);
 
 	/*
 	 * Release locks early. That's ok here because nothing in the called
@@ -325,8 +337,8 @@ btree_index_checkable(Relation rel)
  * parent/child check cannot be affected.)
  */
 static void
-bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
-					 bool heapallindexed)
+bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
+					 bool readonly, bool heapallindexed)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -347,6 +359,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly,
 	state = palloc0(sizeof(BtreeCheckState));
 	state->rel = rel;
 	state->heaprel = heaprel;
+	state->heapkeyspace = heapkeyspace;
 	state->readonly = readonly;
 	state->heapallindexed = heapallindexed;
 
@@ -807,7 +820,8 @@ bt_target_page_check(BtreeCheckState *state)
 	 * doesn't contain a high key, so nothing to check
 	 */
 	if (!P_RIGHTMOST(topaque) &&
-		!_bt_check_natts(state->rel, state->target, P_HIKEY))
+		!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+						 P_HIKEY))
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
@@ -840,6 +854,7 @@ bt_target_page_check(BtreeCheckState *state)
 		IndexTuple	itup;
 		size_t		tupsize;
 		BTScanInsert skey;
+		bool		lowersizelimit;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -866,7 +881,8 @@ bt_target_page_check(BtreeCheckState *state)
 					 errhint("This could be a torn page problem.")));
 
 		/* Check the number of index tuple attributes */
-		if (!_bt_check_natts(state->rel, state->target, offset))
+		if (!_bt_check_natts(state->rel, state->heapkeyspace, state->target,
+							 offset))
 		{
 			char	   *itid,
 					   *htid;
@@ -907,7 +923,56 @@ bt_target_page_check(BtreeCheckState *state)
 			continue;
 
 		/* Build insertion scankey for current page offset */
-		skey = _bt_mkscankey(state->rel, itup);
+		skey = bt_mkscankey_pivotsearch(state->rel, itup);
+
+		/*
+		 * Make sure tuple size does not exceed the relevant BTREE_VERSION
+		 * specific limit.
+		 *
+		 * BTREE_VERSION 4 (which introduced heapkeyspace rules) requisitioned
+		 * a small amount of space from BTMaxItemSize() in order to ensure
+		 * that suffix truncation always has enough space to add an explicit
+		 * heap TID back to a tuple -- we pessimistically assume that every
+		 * newly inserted tuple will eventually need to have a heap TID
+		 * appended during a future leaf page split, when the tuple becomes
+		 * the basis of the new high key (pivot tuple) for the leaf page.
+		 *
+		 * Since the reclaimed space is reserved for that purpose, we must not
+		 * enforce the slightly lower limit when the extra space has been used
+		 * as intended.  In other words, there is only a cross-version
+		 * difference in the limit on tuple size within leaf pages.
+		 *
+		 * Still, we're particular about the details within BTREE_VERSION 4
+		 * internal pages.  Pivot tuples may only use the extra space for its
+		 * designated purpose.  Enforce the lower limit for pivot tuples when
+		 * an explicit heap TID isn't actually present. (In all other cases
+		 * suffix truncation is guaranteed to generate a pivot tuple that's no
+		 * larger than the first right tuple provided to it by its caller.)
+		 */
+		lowersizelimit = skey->heapkeyspace &&
+			(P_ISLEAF(topaque) || BTreeTupleGetHeapTID(itup) == NULL);
+		if (tupsize > (lowersizelimit ? BTMaxItemSize(state->target) :
+					   BTMaxItemSizeNoHeapTid(state->target)))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
+							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index row size %zu exceeds maximum for index \"%s\"",
+							tupsize, RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -941,9 +1006,35 @@ bt_target_page_check(BtreeCheckState *state)
 		 * grandparents (as well as great-grandparents, and so on).  We don't
 		 * go to those lengths because that would be prohibitively expensive,
 		 * and probably not markedly more effective in practice.
+		 *
+		 * On the leaf level, we check that the key is <= the highkey.
+		 * However, on non-leaf levels we check that the key is < the highkey,
+		 * because the high key is "just another separator" rather than a copy
+		 * of some existing key item; we expect it to be unique among all keys
+		 * on the same level.  (Suffix truncation will sometimes produce a
+		 * leaf highkey that is an untruncated copy of the lastleft item, but
+		 * never any other item, which necessitates weakening the leaf level
+		 * check to <=.)
+		 *
+		 * Full explanation for why a highkey is never truly a copy of another
+		 * item from the same level on internal levels:
+		 *
+		 * While the new left page's high key is copied from the first offset
+		 * on the right page during an internal page split, that's not the
+		 * full story.  In effect, internal pages are split in the middle of
+		 * the firstright tuple, not between the would-be lastleft and
+		 * firstright tuples: the firstright key ends up on the left side as
+		 * left's new highkey, and the firstright downlink ends up on the
+		 * right side as right's new "negative infinity" item.  The negative
+		 * infinity tuple is truncated to zero attributes, so we're only left
+		 * with the downlink.  In other words, the copying is just an
+		 * implementation detail of splitting in the middle of a (pivot)
+		 * tuple. (See also: "Notes About Data Representation" in the nbtree
+		 * README.)
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
+			  invariant_l_offset(state, skey, P_HIKEY)))
 		{
 			char	   *itid,
 					   *htid;
@@ -969,11 +1060,10 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1036,7 +1126,7 @@ bt_target_page_check(BtreeCheckState *state)
 			rightkey = bt_right_page_check_scankey(state);
 
 			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+				!invariant_g_offset(state, rightkey, max))
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
@@ -1214,9 +1304,9 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * continued existence of target block as non-ignorable (not half-dead or
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
-	 * parent either has the same downlink to target as before, or a <=
+	 * parent either has the same downlink to target as before, or a <
 	 * downlink due to deletion at the left of target.  Target either has the
-	 * same highkey as before, or a highkey <= before when there is a page
+	 * same highkey as before, or a highkey < before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
 	 * which for our purposes is equivalent to target's highkey itself never
@@ -1305,7 +1395,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * memory remaining allocated.
 	 */
 	firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
-	return _bt_mkscankey(state->rel, firstitup);
+	return bt_mkscankey_pivotsearch(state->rel, firstitup);
 }
 
 /*
@@ -1368,7 +1458,8 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the
+	 * page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1417,14 +1508,29 @@ bt_downlink_check(BtreeCheckState *state, BTScanInsert targetkey,
 	{
 		/*
 		 * Skip comparison of target page key against "negative infinity"
-		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * item, if any.  Checking it would indicate that it's not a strict
+		 * lower bound, but that's only because of the hard-coding for
+		 * negative infinity items within _bt_compare().
+		 *
+		 * If nbtree didn't truncate negative infinity tuples during internal
+		 * page splits then we'd expect child's negative infinity key to be
+		 * equal to the scankey/downlink from target/parent (it would be a
+		 * "low key" in this hypothetical scenario, and so it would still need
+		 * to be treated as a special case here).
+		 *
+		 * Negative infinity items can be thought of as a strict lower bound
+		 * that works transitively, with the last non-negative-infinity pivot
+		 * followed during a descent from the root as its "true" strict lower
+		 * bound.  Only a small number of negative infinity items are truly
+		 * negative infinity; those that are the first items of leftmost
+		 * internal pages.  In more general terms, a negative infinity item is
+		 * only negative infinity with respect to the subtree that the page is
+		 * at the root of.
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, targetkey, child, offset))
+		if (!invariant_l_nontarget_offset(state, targetkey, child, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1856,6 +1962,64 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
 }
 
+/*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	Assert(key->pivotsearch);
+
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return invariant_leq_offset(state, key, upperbound);
+
+	cmp = _bt_compare(state->rel, key, state->target, upperbound);
+
+	/*
+	 * _bt_compare() is capable of determining that a scankey with a
+	 * filled-out attribute is greater than pivot tuples where the comparison
+	 * is resolved at a truncated attribute (value of attribute in pivot is
+	 * minus infinity).  However, it is not capable of determining that a
+	 * scankey is _less than_ a tuple on the basis of a comparison resolved at
+	 * _scankey_ minus infinity attribute.  Complete an extra step to simulate
+	 * having minus infinity values for omitted scankey attribute(s).
+	 */
+	if (cmp == 0)
+	{
+		BTPageOpaque topaque;
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+		nonpivot = P_ISLEAF(topaque) && upperbound >= P_FIRSTDATAKEY(topaque);
+
+		/* Get number of keys + heap TID for item to the right */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+		rheaptid = BTreeTupleGetHeapTIDCareful(state, ritup, nonpivot);
+
+		/* Heap TID is tiebreaker key attribute */
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && rheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
 /*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
@@ -1869,48 +2033,97 @@ invariant_leq_offset(BtreeCheckState *state, BTScanInsert key,
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, state->target, upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, BTScanInsert key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, BTScanInsert key,
+				   OffsetNumber lowerbound)
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, state->target, lowerbound);
 
-	return cmp >= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp >= 0;
+
+	/*
+	 * No need to consider the possibility that scankey has attributes that we
+	 * need to force to be interpreted as negative infinity.  _bt_compare() is
+	 * able to determine that scankey is greater than negative infinity.  The
+	 * distinction between "==" and "<" isn't interesting here, since
+	 * corruption is indicated either way.
+	 */
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e. the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
-							   Page nontarget, OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, BTScanInsert key,
+							 Page nontarget, OffsetNumber upperbound)
 {
 	int32		cmp;
 
+	Assert(key->pivotsearch);
+
 	cmp = _bt_compare(state->rel, key, nontarget, upperbound);
 
-	return cmp <= 0;
+	/* pg_upgrade'd indexes may legally have equal sibling tuples */
+	if (!key->heapkeyspace)
+		return cmp <= 0;
+
+	/* See invariant_l_offset() for an explanation of this extra step */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+		BTPageOpaque copaque;
+		bool		nonpivot;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		copaque = (BTPageOpaque) PageGetSpecialPointer(nontarget);
+		nonpivot = P_ISLEAF(copaque) && upperbound >= P_FIRSTDATAKEY(copaque);
+
+		/* Get number of keys + heap TID for child/non-target item */
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+		childheaptid = BTreeTupleGetHeapTIDCareful(state, child, nonpivot);
+
+		/* Heap TID is tiebreaker key attribute */
+		if (key->keysz == uppnkeyatts)
+			return key->scantid == NULL && childheaptid != NULL;
+
+		return key->keysz < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
@@ -2066,3 +2279,53 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 
 	return page;
 }
+
+/*
+ * _bt_mkscankey() wrapper that automatically prevents insertion scankey from
+ * being considered greater than the pivot tuple that its values originated
+ * from (or some other identical pivot tuple) in the common case where there
+ * are truncated/minus infinity attributes.  Without this extra step, there
+ * are forms of corruption that amcheck could theoretically fail to report.
+ *
+ * For example, invariant_g_offset() might miss a cross-page invariant failure
+ * on an internal level if the scankey built from the first item on the
+ * target's right sibling page happened to be equal to (not greater than) the
+ * last item on target page.  The !pivotsearch tiebreaker in _bt_compare()
+ * might otherwise cause amcheck to assume (rather than actually verify) that
+ * the scankey is greater.
+ */
+static inline BTScanInsert
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
+{
+	BTScanInsert skey;
+
+	skey = _bt_mkscankey(rel, itup);
+	skey->pivotsearch = true;
+
+	return skey;
+}
+
+/*
+ * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
+ * be present in cases where that is mandatory.
+ *
+ * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
+ * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
+ * It may become more useful in the future, when non-pivot tuples support their
+ * own alternative INDEX_ALT_TID_MASK representation.
+ */
+static inline ItemPointer
+BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
+							bool nonpivot)
+{
+	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	BlockNumber targetblock = state->targetblock;
+
+	if (result == NULL && nonpivot)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
+						targetblock, RelationGetRelationName(state->rel))));
+
+	return result;
+}
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index bfa0c04c2f1..8d27c9b0f6f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -561,7 +561,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	 * Get values of extended metadata if available, use default values
 	 * otherwise.
 	 */
-	if (metad->btm_version == BTREE_VERSION)
+	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 2aaa4df53b1..07c2dcd7714 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -5,7 +5,7 @@ CREATE INDEX test1_a_idx ON test1 USING btree (a);
 SELECT * FROM bt_metap('test1_a_idx');
 -[ RECORD 1 ]-----------+-------
 magic                   | 340322
-version                 | 3
+version                 | 4
 root                    | 1
 level                   | 0
 fastroot                | 1
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 9858ea69d49..9920dbfd408 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -48,7 +48,7 @@ select version, tree_level,
     from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -58,7 +58,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::text);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -68,7 +68,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::name);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select version, tree_level,
@@ -78,7 +78,7 @@ select version, tree_level,
     from pgstatindex('test_pkey'::regclass);
  version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
 ---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
-       3 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
+       4 |          0 |          1 |             0 |              0 |          0 |           0 |             0 |              NaN |                NaN
 (1 row)
 
 select pg_relpages('test');
@@ -232,7 +232,7 @@ create index test_partition_hash_idx on test_partition using hash (a);
 select pgstatindex('test_partition_idx');
          pgstatindex          
 ------------------------------
- (3,0,8192,0,0,0,0,0,NaN,NaN)
+ (4,0,8192,0,0,0,0,0,NaN,NaN)
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 9943e8ecd4c..3493f482b86 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -504,8 +504,9 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
 
   <para>
    By default, B-tree indexes store their entries in ascending order
-   with nulls last.  This means that a forward scan of an index on
-   column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
+   with nulls last (table TID is treated as a tiebreaker column among
+   otherwise equal entries).  This means that a forward scan of an
+   index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal>
    (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>).  The
    index can also be scanned backward, producing output satisfying
    <literal>ORDER BY x DESC</literal>
@@ -1162,10 +1163,21 @@ CREATE INDEX tab_x_y ON tab(x, y);
    the extra columns are trailing columns; making them be leading columns is
    unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>.
    However, this method doesn't support the case where you want the index to
-   enforce uniqueness on the key column(s).  Also, explicitly marking
-   non-searchable columns as <literal>INCLUDE</literal> columns makes the
-   index slightly smaller, because such columns need not be stored in upper
-   tree levels.
+   enforce uniqueness on the key column(s).
+  </para>
+
+  <para>
+   <firstterm>Suffix truncation</firstterm> always removes non-key
+   columns from upper B-Tree levels.  As payload columns, they are
+   never used to guide index scans.  The truncation process also
+   removes one or more trailing key column(s) when the remaining
+   prefix of key column(s) happens to be sufficient to describe tuples
+   on the lowest B-Tree level.  In practice, covering indexes without
+   an <literal>INCLUDE</literal> clause often avoid storing columns
+   that are effectively payload in the upper levels.  However,
+   explicitly defining payload columns as non-key columns
+   <emphasis>reliably</emphasis> keeps the tuples in upper levels
+   small.
   </para>
 
   <para>
diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c
index 32c0ebb93a4..cb23be859de 100644
--- a/src/backend/access/common/indextuple.c
+++ b/src/backend/access/common/indextuple.c
@@ -536,7 +536,11 @@ index_truncate_tuple(TupleDesc sourceDescriptor, IndexTuple source,
 	bool		isnull[INDEX_MAX_KEYS];
 	IndexTuple	truncated;
 
-	Assert(leavenatts < sourceDescriptor->natts);
+	Assert(leavenatts <= sourceDescriptor->natts);
+
+	/* Easy case: no truncation actually required */
+	if (leavenatts == sourceDescriptor->natts)
+		return CopyIndexTuple(source);
 
 	/* Create temporary descriptor to scribble on */
 	truncdesc = palloc(TupleDescSize(sourceDescriptor));
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index a295a7a286d..40ff25fe062 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -28,37 +28,50 @@ right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.
 
+Lehman and Yao talk about pairs of "separator" keys and downlinks in
+internal pages rather than tuples or records.  We use the term "pivot"
+tuple to refer to tuples which don't point to heap tuples, that are used
+only for tree navigation.  All tuples on non-leaf pages and high keys on
+leaf pages are pivot tuples.  Since pivot tuples are only used to represent
+which part of the key space belongs on each page, they can have attribute
+values copied from non-pivot tuples that were deleted and killed by VACUUM
+some time ago.  A pivot tuple may contain a "separator" key and downlink,
+just a separator key (i.e. the downlink value is implicitly undefined), or
+just a downlink (i.e. all attributes are truncated away).  We aren't always
+clear on which case applies, but it should be obvious from context.
+
+The requirement that all btree keys be unique is satisfied by treating heap
+TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
+order.  This is necessary because Lehman and Yao also require that the key
+range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
+the adjacent keys in the parent page (Ki must be _strictly_ less than v,
+which can be assured by having reliably unique keys).
+
+A search where the key is equal to a pivot tuple in an upper tree level
+must descend to the left of that pivot to ensure it finds any equal keys.
+The equal item(s) being searched for must therefore be to the left of that
+downlink page on the next level down.  A handy property of this design is
+that a scan where all attributes/keys are used behaves just the same as a
+scan where only some prefix of attributes are used; equality never needs to
+be treated as a special case.
+
+In practice, exact equality with pivot tuples on internal pages is
+extremely rare when all attributes (including even the heap TID attribute)
+are used in a search.  This is due to suffix truncation: truncated
+attributes are treated as having the value negative infinity, and
+truncation almost always manages to at least truncate away the trailing
+heap TID attribute.  While Lehman and Yao don't have anything to say about
+suffix truncation, the design used by nbtree is perfectly complementary.
+The later section on suffix truncation will be helpful if it's unclear how
+the Lehman & Yao invariants work with a real world example involving
+suffix truncation.
+
 Differences to the Lehman & Yao algorithm
 -----------------------------------------
 
 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:
 
-The requirement that all btree keys be unique is too onerous,
-but the algorithm won't work correctly without it.  Fortunately, it is
-only necessary that keys be unique on a single tree level, because L&Y
-only use the assumption of key uniqueness when re-finding a key in a
-parent page (to determine where to insert the key for a split page).
-Therefore, we can use the link field to disambiguate multiple
-occurrences of the same user key: only one entry in the parent level
-will be pointing at the page we had split.  (Indeed we need not look at
-the real "key" at all, just at the link field.)  We can distinguish
-items at the leaf level in the same way, by examining their links to
-heap tuples; we'd never have two items for the same heap tuple.
-
-Lehman and Yao assume that the key range for a subtree S is described
-by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
-
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
@@ -194,9 +207,7 @@ be prepared for the possibility that the item it wants is to the left of
 the recorded position (but it can't have moved left out of the recorded
 page).  Since we hold a lock on the lower page (per L&Y) until we have
 re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.  Also, because
-we are matching downlink page numbers and not data keys, we don't have any
-problem with possibly misidentifying the parent item.
+parent item does still exist and can't have been deleted.
 
 Page Deletion
 -------------
@@ -615,22 +626,40 @@ scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).
 
-We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+Notes about suffix truncation
+-----------------------------
+
+We truncate away suffix key attributes that are not needed for a page high
+key during a leaf page split.  The remaining attributes must distinguish
+the last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  Tuples logically retain truncated key attributes,
+though they implicitly have "negative infinity" as their value, and have no
+storage overhead.  Since the high key is subsequently reused as the
+downlink in the parent page for the new right page, suffix truncation makes
+pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
+attributes truncated at the time of a leaf page split, but may also have
+some key attributes truncated away, based on the usual criteria for key
+attributes.  They are not a special case, since non-key attributes are
+merely payload to B-Tree searches.
+
+The goal of suffix truncation of key attributes is to improve index
+fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
+and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
+2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
+based on their paper.  Note that Postgres only implements what the paper
+refers to as simple prefix B-Trees.  Note also that the paper assumes that
+the tree has keys that consist of single strings that maintain the "prefix
+property", much like strings that are stored in a suffix tree (comparisons
+of earlier bytes must always be more significant than comparisons of later
+bytes, and, in general, the strings must compare in a way that doesn't
+break transitive consistency as they're split into pieces).  Suffix
+truncation in Postgres currently only works at the whole-attribute
+granularity, but it would be straightforward to invent opclass
+infrastructure that manufactures a smaller attribute value in the case of
+variable-length types, such as text.  An opclass support function could
+manufacture the shortest possible key value that still correctly separates
+each half of a leaf page split.
 
 Notes About Data Representation
 -------------------------------
@@ -643,20 +672,26 @@ don't need to renumber any existing pages when splitting the root.)
 
 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
-so we have to play some games.
+so we have to play some games.  (Presumably things are explained this
+way because of internal page splits, which conceptually split at the
+middle of an existing pivot tuple -- the tuple's "separator" key goes on
+the left side of the split as the left side's new high key, while the
+tuple's pointer/downlink goes on the right side as the first/minus
+infinity downlink.)
 
 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
-rightmost has no "high key", so data items start with the first item.
-Putting the high key at the left, rather than the right, may seem odd,
-but it avoids moving the high key as we add data items.
+rightmost has no "high key" (it's implicitly positive infinity), so
+data items start with the first item.  Putting the high key at the
+left, rather than the right, may seem odd, but it avoids moving the
+high key as we add data items.
 
 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.
 
 On a non-leaf page, the data items are down-links to child pages with
-bounding keys.  The key in each data item is the *lower* bound for
+bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
@@ -664,4 +699,5 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 41aaa1bdd46..11e5725e095 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -61,14 +61,16 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
 				  BTStack stack,
 				  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
+static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
+			   Buffer buf,
+			   Buffer cbuf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
 			   bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
-		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
-		  IndexTuple newitem, bool newitemonleft);
+static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
+		  Buffer cbuf, OffsetNumber firstright, OffsetNumber newitemoff,
+		  Size newitemsz, IndexTuple newitem, bool newitemonleft);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 				  BTStack stack, bool is_root, bool is_only);
 static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
@@ -116,6 +118,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
+	/* No scantid until uniqueness established in checkingunique case */
+	if (checkingunique && itup_key->heapkeyspace)
+		itup_key->scantid = NULL;
 
 	/*
 	 * Fill in the BTInsertState working area, to track the current page and
@@ -231,12 +236,13 @@ top:
 	 * NOTE: obviously, _bt_check_unique can only detect keys that are already
 	 * in the index; so it cannot defend against concurrent insertions of the
 	 * same key.  We protect against that by means of holding a write lock on
-	 * the target page.  Any other would-be inserter of the same key must
-	 * acquire a write lock on the same target page, so only one would-be
-	 * inserter can be making the check at one time.  Furthermore, once we are
-	 * past the check we hold write locks continuously until we have performed
-	 * our insertion, so no later inserter can fail to see our insertion.
-	 * (This requires some care in _bt_findinsertloc.)
+	 * the first page the value could be on, regardless of the value of its
+	 * implicit heap TID tiebreaker attribute.  Any other would-be inserter of
+	 * the same key must acquire a write lock on the same page, so only one
+	 * would-be inserter can be making the check at one time.  Furthermore,
+	 * once we are past the check we hold write locks continuously until we
+	 * have performed our insertion, so no later inserter can fail to see our
+	 * insertion.  (This requires some care in _bt_findinsertloc.)
 	 *
 	 * If we must wait for another xact, we release the lock while waiting,
 	 * and then must start over completely.
@@ -274,6 +280,10 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/* Uniqueness is established -- restore heap tid as scantid */
+		if (itup_key->heapkeyspace)
+			itup_key->scantid = &itup->t_tid;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -282,12 +292,11 @@ top:
 
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
-		 * an index tuple insert conflicts with an existing lock.  Since the
-		 * actual location of the insert is hard to predict because of the
-		 * random search used to prevent O(N^2) performance when there are
-		 * many duplicate entries, we can just use the "first valid" page.
-		 * This reasoning also applies to INCLUDE indexes, whose extra
-		 * attributes are not considered part of the key space.
+		 * an index tuple insert conflicts with an existing lock.  We don't
+		 * know the actual page we're going to insert to yet, because scantid
+		 * was not filled in initially, but it's okay to use the "first valid"
+		 * page instead.  This reasoning also applies to INCLUDE indexes,
+		 * whose extra attributes are not considered part of the key space.
 		 */
 		CheckForSerializableConflictIn(rel, NULL, insertstate.buf);
 
@@ -298,8 +307,8 @@ top:
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
-		_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
-					   newitemoff, false);
+		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
+					   itup, newitemoff, false);
 	}
 	else
 	{
@@ -371,6 +380,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 	 * Scan over all equal tuples, looking for live conflicts.
 	 */
 	Assert(!insertstate->bounds_valid || insertstate->low == offset);
+	Assert(itup_key->scantid == NULL);
 	for (;;)
 	{
 		ItemId		curitemid;
@@ -642,27 +652,33 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 /*
  *	_bt_findinsertloc() -- Finds an insert location for a tuple
  *
- *		On entry, insertstate buffer contains the first legal page the new
- *		tuple could be inserted to.  It is exclusive-locked and pinned by the
- *		caller.
+ *		On entry, insertstate buffer contains the page the new tuple
+ *		belongs on.  It is exclusive-locked and pinned by the caller.
  *
- *		If the new key is equal to one or more existing keys, we can
- *		legitimately place it anywhere in the series of equal keys --- in
- *		fact, if the new key is equal to the page's "high key" we can place it
- *		on the next page.  If it is equal to the high key, and there's not
- *		room to insert the new tuple on the current page without splitting,
- *		then we can move right hoping to find more free space and avoid a
- *		split.  Furthermore, if there's not enough room on a page, we try to
- *		make room by removing any LP_DEAD tuples.
+ *		If 'checkingunique' is true, the buffer on entry is the first page
+ *		that contains duplicates of the new key.  If there are duplicates on
+ *		multiple pages, the correct insertion position might be some page to
+ *		the right, rather than the first page.  In that case, this function
+ *		moves right to the correct target page.
+ *
+ *		(In a !heapkeyspace index, there can be multiple pages with the same
+ *		high key, where the new tuple could legitimately be placed on.  In
+ *		that case, the caller passes the first page containing duplicates,
+ *		just like when checkinunique=true.  If that page doesn't have enough
+ *		room for the new tuple, this function moves right, trying to find a
+ *		legal page that does.)
  *
  *		On exit, insertstate buffer contains the chosen insertion page, and
- *		the offset within that page is returned.  If _bt_findinsertloc decides
+ *		the offset within that page is returned.  If _bt_findinsertloc needed
  *		to move right, the lock and pin on the original page are released, and
  *		the new buffer is exclusively locked and pinned instead.
  *
  *		If insertstate contains cached binary search bounds, we will take
  *		advantage of them.  This avoids repeating comparisons that we made in
  *		_bt_check_unique() already.
+ *
+ *		If there is not enough room on the page for the new tuple, we try to
+ *		make room by removing any LP_DEAD tuples.
  */
 static OffsetNumber
 _bt_findinsertloc(Relation rel,
@@ -678,92 +694,145 @@ _bt_findinsertloc(Relation rel,
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
-	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itemsz doesn't
-	 * include the ItemId.
-	 *
-	 * NOTE: if you change this, see also the similar code in _bt_buildadd().
-	 */
-	if (insertstate->itemsz > BTMaxItemSize(page))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						insertstate->itemsz, BTMaxItemSize(page),
-						RelationGetRelationName(rel)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(heapRel,
-									RelationGetRelationName(rel))));
+	/* Check 1/3 of a page restriction */
+	if (unlikely(insertstate->itemsz > BTMaxItemSize(page)))
+		_bt_check_third_page(rel, heapRel, itup_key->heapkeyspace, page,
+							 insertstate->itup);
 
-	/*----------
-	 * If we will need to split the page to put the item on this page,
-	 * check whether we can put the tuple somewhere to the right,
-	 * instead.  Keep scanning right until we
-	 *		(a) find a page with enough free space,
-	 *		(b) reach the last page where the tuple can legally go, or
-	 *		(c) get tired of searching.
-	 * (c) is not flippant; it is important because if there are many
-	 * pages' worth of equal keys, it's better to split one of the early
-	 * pages than to scan all the way to the end of the run of equal keys
-	 * on every insert.  We implement "get tired" as a random choice,
-	 * since stopping after scanning a fixed number of pages wouldn't work
-	 * well (we'd never reach the right-hand side of previously split
-	 * pages).  Currently the probability of moving right is set at 0.99,
-	 * which may seem too high to change the behavior much, but it does an
-	 * excellent job of preventing O(N^2) behavior with many equal keys.
-	 *----------
-	 */
 	Assert(P_ISLEAF(lpageop) && !P_INCOMPLETE_SPLIT(lpageop));
 	Assert(!insertstate->bounds_valid || checkingunique);
-	while (PageGetFreeSpace(page) < insertstate->itemsz)
+	Assert(!itup_key->heapkeyspace || itup_key->scantid != NULL);
+	Assert(itup_key->heapkeyspace || itup_key->scantid == NULL);
+
+	if (itup_key->heapkeyspace)
 	{
 		/*
-		 * before considering moving right, see if we can obtain enough space
-		 * by erasing LP_DEAD items
+		 * If we're inserting into a unique index, we may have to walk right
+		 * through leaf pages to find the one leaf page that we must insert on
+		 * to.  This is necessary, because when we're checking for duplicates,
+		 * a scantid is not used when we call _bt_search().  It is only set
+		 * after _bt_check_unique() has checked for duplicates, so
+		 * insertstate->buf will point to the page containing the first
+		 * duplicate key, rather than the exact page that this tuple belongs
+		 * to.
 		 */
-		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+		if (checkingunique)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			for (;;)
+			{
+				/*
+				 * Does the new tuple belong on this page?
+				 *
+				 * The earlier _bt_check_unique() call may well have
+				 * established an upper bound on the offset where the new item
+				 * must go to. If it's not the last item of the page, i.e. if
+				 * there is at least one tuple on the page that's greater than
+				 * the tuple we're inserting to, then we know that the tuple
+				 * belongs on this page, and we can skip the high key check.
+				 */
+				if (insertstate->bounds_valid &&
+					insertstate->low <= insertstate->stricthigh &&
+					insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+					break;
+
+				if (P_RIGHTMOST(lpageop) ||
+					_bt_compare(rel, itup_key, page, P_HIKEY) != 0)
+					break;
 
-			if (PageGetFreeSpace(page) >= insertstate->itemsz)
-				break;			/* OK, now we have enough space */
+				_bt_stepright(rel, insertstate, stack);
+				page = BufferGetPage(insertstate->buf);
+				lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			}
 		}
 
 		/*
-		 * nope, so check conditions (b) and (c) enumerated above
+		 * If the target page is full, see if we can obtain enough space by
+		 * erasing LP_DEAD items.
+		 */
+		if (PageGetFreeSpace(page) < insertstate->itemsz &&
+			P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+		{
+			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+			insertstate->bounds_valid = false;
+		}
+	}
+	else
+	{
+		/*----------
+		 * This is a !heapkeyspace (version 3) index.  The current page is
+		 * the first legitimate page that we could insert the new tuple to.
+		 *
+		 * If the new key is equal to one or more existing keys, we can
+		 * legitimately place it anywhere in the series of equal keys.  In
+		 * fact, if the new key is equal to the page's "high key" we can place
+		 * it on the next page.  If it is equal to the high key, and there's
+		 * not room to insert the new tuple on the current page without
+		 * splitting, then we move right hoping to find more free space and
+		 * avoid a split.
 		 *
-		 * An earlier _bt_check_unique() call may well have established bounds
-		 * that we can use to skip the high key check for checkingunique
-		 * callers.  This fastpath cannot be used when there are no items on
-		 * the existing page (other than high key), or when it looks like the
-		 * new item belongs last on the page, but it might go on a later page
-		 * instead.
+		 * Keep scanning right until we
+		 *		(a) find a page with enough free space,
+		 *		(b) reach the last page where the tuple can legally go, or
+		 *		(c) get tired of searching.
+		 * (c) is not flippant; it is important because if there are many
+		 * pages' worth of equal keys, it's better to split one of the early
+		 * pages than to scan all the way to the end of the run of equal keys
+		 * on every insert.  We implement "get tired" as a random choice,
+		 * since stopping after scanning a fixed number of pages wouldn't work
+		 * well (we'd never reach the right-hand side of previously split
+		 * pages).  Currently the probability of moving right is set at 0.99,
+		 * which may seem too high to change the behavior much, but it does an
+		 * excellent job of preventing O(N^2) behavior with many equal keys.
+		 *----------
 		 */
-		if (insertstate->bounds_valid &&
-			insertstate->low <= insertstate->stricthigh &&
-			insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
-			break;
+		while (PageGetFreeSpace(page) < insertstate->itemsz)
+		{
+			/*
+			 * Before considering moving right, see if we can obtain enough
+			 * space by erasing LP_DEAD items
+			 */
+			if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
 
-		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
-			break;
+				if (PageGetFreeSpace(page) >= insertstate->itemsz)
+					break;		/* OK, now we have enough space */
+			}
 
-		_bt_stepright(rel, insertstate, stack);
-		page = BufferGetPage(insertstate->buf);
-		lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+			/*
+			 * Nope, so check conditions (b) and (c) enumerated above
+			 *
+			 * An earlier _bt_check_unique() call may well have established an
+			 * upper bound on the offset where the new item must go to. If
+			 * it's not the last item of the page, i.e. if there is at least
+			 * one tuple on the page that's greater than the tuple we're
+			 * inserting to, then we know that the tuple belongs on this page,
+			 * and we can skip the high key check.
+			 */
+			if (insertstate->bounds_valid &&
+				insertstate->low <= insertstate->stricthigh &&
+				insertstate->stricthigh <= PageGetMaxOffsetNumber(page))
+				break;
+
+			if (P_RIGHTMOST(lpageop) ||
+				_bt_compare(rel, itup_key, page, P_HIKEY) != 0 ||
+				random() <= (MAX_RANDOM_VALUE / 100))
+				break;
+
+			_bt_stepright(rel, insertstate, stack);
+			page = BufferGetPage(insertstate->buf);
+			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		}
 	}
 
-	/* Loop should not break until correct page located */
+	/*
+	 * We should now be on the correct page.  Find the offset within the page
+	 * for the new tuple. (Possibly reusing earlier search bounds.)
+	 */
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	/* Find new item offset, possibly reusing earlier search bounds */
 	newitemoff = _bt_binsrch_insert(rel, insertstate);
 
 	return newitemoff;
@@ -832,8 +901,9 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		This recursive procedure does the following things:
  *
- *			+  if necessary, splits the target page (making sure that the
- *			   split is equitable as far as post-insert free space goes).
+ *			+  if necessary, splits the target page, using 'itup_key' for
+ *			   suffix truncation on leaf pages (caller passes NULL for
+ *			   non-leaf pages).
  *			+  inserts the tuple.
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
@@ -859,6 +929,7 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  */
 static void
 _bt_insertonpg(Relation rel,
+			   BTScanInsert itup_key,
 			   Buffer buf,
 			   Buffer cbuf,
 			   BTStack stack,
@@ -881,7 +952,7 @@ _bt_insertonpg(Relation rel,
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -931,8 +1002,8 @@ _bt_insertonpg(Relation rel,
 									  &newitemonleft);
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, buf, cbuf, firstright,
-						 newitemoff, itemsz, itup, newitemonleft);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, firstright, newitemoff,
+						 itemsz, itup, newitemonleft);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1016,7 +1087,7 @@ _bt_insertonpg(Relation rel,
 		if (BufferIsValid(metabuf))
 		{
 			/* upgrade meta-page if needed */
-			if (metad->btm_version < BTREE_VERSION)
+			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = itup_blkno;
 			metad->btm_fastlevel = lpageop->btpo.level;
@@ -1071,6 +1142,8 @@ _bt_insertonpg(Relation rel,
 
 			if (BufferIsValid(metabuf))
 			{
+				Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+				xlmeta.version = metad->btm_version;
 				xlmeta.root = metad->btm_root;
 				xlmeta.level = metad->btm_level;
 				xlmeta.fastroot = metad->btm_fastroot;
@@ -1138,17 +1211,19 @@ _bt_insertonpg(Relation rel,
  *		new right page.  newitemoff etc. tell us about the new item that
  *		must be inserted along with the data from the old page.
  *
- *		When splitting a non-leaf page, 'cbuf' is the left-sibling of the
- *		page we're inserting the downlink for.  This function will clear the
- *		INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *		itup_key is used for suffix truncation on leaf pages (internal
+ *		page callers pass NULL).  When splitting a non-leaf page, 'cbuf'
+ *		is the left-sibling of the page we're inserting the downlink for.
+ *		This function will clear the INCOMPLETE_SPLIT flag on it, and
+ *		release the buffer.
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
  */
 static Buffer
-_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
-		  bool newitemonleft)
+_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
+		  OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
+		  IndexTuple newitem, bool newitemonleft)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1242,7 +1317,8 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <= indnkeyatts);
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1256,8 +1332,29 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 
 	/*
 	 * The "high key" for the new left page will be the first key that's going
-	 * to go into the new right page.  This might be either the existing data
-	 * item at position firstright, or the incoming tuple.
+	 * to go into the new right page, or possibly a truncated version if this
+	 * is a leaf page split.  This might be either the existing data item at
+	 * position firstright, or the incoming tuple.
+	 *
+	 * The high key for the left page is formed using the first item on the
+	 * right page, which may seem to be contrary to Lehman & Yao's approach of
+	 * using the left page's last item as its new high key when splitting on
+	 * the leaf level.  It isn't, though: suffix truncation will leave the
+	 * left page's high key fully equal to the last item on the left page when
+	 * two tuples with equal key values (excluding heap TID) enclose the split
+	 * point.  It isn't actually necessary for a new leaf high key to be equal
+	 * to the last item on the left for the L&Y "subtree" invariant to hold.
+	 * It's sufficient to make sure that the new leaf high key is strictly
+	 * less than the first item on the right leaf page, and greater than or
+	 * equal to (not necessarily equal to) the last item on the left leaf
+	 * page.
+	 *
+	 * In other words, when suffix truncation isn't possible, L&Y's exact
+	 * approach to leaf splits is taken.  (Actually, even that is slightly
+	 * inaccurate.  A tuple with all the keys from firstright but the heap TID
+	 * from lastleft will be used as the new high key, since the last left
+	 * tuple could be physically larger despite being opclass-equal in respect
+	 * of all attributes prior to the heap TID attribute.)
 	 */
 	leftoff = P_HIKEY;
 	if (!newitemonleft && newitemoff == firstright)
@@ -1275,25 +1372,48 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
+	 * Truncate unneeded key and non-key attributes of the high key item
+	 * before inserting it on the left page.  This can only happen at the leaf
 	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * level high keys.  A pivot tuple in a grandparent page must guide a
+	 * search not only to the correct parent page, but also to the correct
+	 * leaf page.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf && (itup_key->heapkeyspace || indnatts != indnkeyatts))
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		IndexTuple	lastleft;
+
+		/*
+		 * Determine which tuple will become the last on the left page.  This
+		 * is needed to decide how many attributes from the first item on the
+		 * right page must remain in new high key for left page.
+		 */
+		if (newitemonleft && newitemoff == firstright)
+		{
+			/* incoming tuple will become last on left page */
+			lastleft = newitem;
+		}
+		else
+		{
+			OffsetNumber lastleftoff;
+
+			/* item just before firstright will become last on left page */
+			lastleftoff = OffsetNumberPrev(firstright);
+			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
+			itemid = PageGetItemId(origpage, lastleftoff);
+			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+		}
+
+		Assert(lastleft != item);
+		lefthikey = _bt_truncate(rel, lastleft, item, itup_key);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <= indnkeyatts);
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1486,7 +1606,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		xl_btree_split xlrec;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		bool		loglhikey = false;
 
 		xlrec.level = ropaque->btpo.level;
 		xlrec.firstright = firstright;
@@ -1515,22 +1634,10 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log the left page's new high key */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -1546,9 +1653,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ?
-			(loglhikey ? XLOG_BTREE_SPLIT_L_HIGHKEY : XLOG_BTREE_SPLIT_L) :
-			(loglhikey ? XLOG_BTREE_SPLIT_R_HIGHKEY : XLOG_BTREE_SPLIT_R);
+		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
 		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
 		PageSetLSN(origpage, recptr);
@@ -1911,7 +2016,7 @@ _bt_insert_parent(Relation rel,
 			_bt_relbuf(rel, pbuf);
 		}
 
-		/* get high key from left page == lower bound for new right page */
+		/* get high key from left, a strict lower bound for new right page */
 		ritem = (IndexTuple) PageGetItem(page,
 										 PageGetItemId(page, P_HIKEY));
 
@@ -1941,7 +2046,7 @@ _bt_insert_parent(Relation rel,
 				 RelationGetRelationName(rel), bknum, rbknum);
 
 		/* Recursively update the parent */
-		_bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
+		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
 					   is_only);
 
@@ -2202,7 +2307,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	START_CRIT_SECTION();
 
 	/* upgrade metapage if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* set btree special data */
@@ -2237,7 +2342,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2270,6 +2376,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		XLogRegisterBuffer(1, lbuf, REGBUF_STANDARD);
 		XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = rootblknum;
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
@@ -2334,6 +2442,7 @@ _bt_pgaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -2349,8 +2458,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
@@ -2363,6 +2472,7 @@ _bt_isequal(TupleDesc itupdesc, BTScanInsert itup_key, Page page,
 	/* Better be comparing to a non-pivot item */
 	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
 	Assert(offnum >= P_FIRSTDATAKEY((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(itup_key->scantid == NULL);
 
 	scankey = itup_key->scankeys;
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 56041c3d383..e046a0570bf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -34,6 +34,7 @@
 #include "utils/snapmgr.h"
 
 static void _bt_cachemetadata(Relation rel, BTMetaPageData *metad);
+static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 						 bool *rightsib_empty);
@@ -77,7 +78,9 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
 }
 
 /*
- *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to the new.
+ *	_bt_upgrademetapage() -- Upgrade a meta-page from an old format to version
+ *		3, the last version that can be updated without broadly affecting
+ *		on-disk compatibility.  (A REINDEX is required to upgrade to v4.)
  *
  *		This routine does purely in-memory image upgrade.  Caller is
  *		responsible for locking, WAL-logging etc.
@@ -93,11 +96,11 @@ _bt_upgrademetapage(Page page)
 
 	/* It must be really a meta page of upgradable version */
 	Assert(metaopaque->btpo_flags & BTP_META);
-	Assert(metad->btm_version < BTREE_VERSION);
+	Assert(metad->btm_version < BTREE_NOVAC_VERSION);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
 
 	/* Set version number and fill extra fields added into version 3 */
-	metad->btm_version = BTREE_VERSION;
+	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
@@ -107,43 +110,79 @@ _bt_upgrademetapage(Page page)
 }
 
 /*
- * Cache metadata from meta page to rel->rd_amcache.
+ * Cache metadata from input meta page to rel->rd_amcache.
  */
 static void
-_bt_cachemetadata(Relation rel, BTMetaPageData *metad)
+_bt_cachemetadata(Relation rel, BTMetaPageData *input)
 {
+	BTMetaPageData *cached_metad;
+
 	/* We assume rel->rd_amcache was already freed by caller */
 	Assert(rel->rd_amcache == NULL);
 	rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
 										 sizeof(BTMetaPageData));
 
-	/*
-	 * Meta page should be of supported version (should be already checked by
-	 * caller).
-	 */
-	Assert(metad->btm_version >= BTREE_MIN_VERSION &&
-		   metad->btm_version <= BTREE_VERSION);
+	/* Meta page should be of supported version */
+	Assert(input->btm_version >= BTREE_MIN_VERSION &&
+		   input->btm_version <= BTREE_VERSION);
 
-	if (metad->btm_version == BTREE_VERSION)
+	cached_metad = (BTMetaPageData *) rel->rd_amcache;
+	if (input->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		/* Last version of meta-data, no need to upgrade */
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		/* Version with compatible meta-data, no need to upgrade */
+		memcpy(cached_metad, input, sizeof(BTMetaPageData));
 	}
 	else
 	{
-		BTMetaPageData *cached_metad = (BTMetaPageData *) rel->rd_amcache;
-
 		/*
 		 * Upgrade meta-data: copy available information from meta-page and
 		 * fill new fields with default values.
+		 *
+		 * Note that we cannot upgrade to version 4+ without a REINDEX, since
+		 * extensive on-disk changes are required.
 		 */
-		memcpy(rel->rd_amcache, metad, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
-		cached_metad->btm_version = BTREE_VERSION;
+		memcpy(cached_metad, input, offsetof(BTMetaPageData, btm_oldest_btpo_xact));
+		cached_metad->btm_version = BTREE_NOVAC_VERSION;
 		cached_metad->btm_oldest_btpo_xact = InvalidTransactionId;
 		cached_metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	}
 }
 
+/*
+ * Get metadata from share-locked buffer containing metapage, while performing
+ * standard sanity checks.  Sanity checks here must match _bt_getroot().
+ */
+static BTMetaPageData *
+_bt_getmeta(Relation rel, Buffer metabuf)
+{
+	Page		metapg;
+	BTPageOpaque metaopaque;
+	BTMetaPageData *metad;
+
+	metapg = BufferGetPage(metabuf);
+	metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
+	metad = BTPageGetMeta(metapg);
+
+	/* sanity-check the metapage */
+	if (!P_ISMETA(metaopaque) ||
+		metad->btm_magic != BTREE_MAGIC)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" is not a btree",
+						RelationGetRelationName(rel))));
+
+	if (metad->btm_version < BTREE_MIN_VERSION ||
+		metad->btm_version > BTREE_VERSION)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("version mismatch in index \"%s\": file version %d, "
+						"current version %d, minimal supported version %d",
+						RelationGetRelationName(rel),
+						metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+
+	return metad;
+}
+
 /*
  *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
  *									  the metapage.
@@ -167,7 +206,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metad = BTPageGetMeta(metapg);
 
 	/* outdated version of metapage always needs rewrite */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
@@ -186,7 +225,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	START_CRIT_SECTION();
 
 	/* upgrade meta-page if needed */
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
@@ -202,6 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		md.version = metad->btm_version;
 		md.root = metad->btm_root;
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
@@ -376,7 +417,7 @@ _bt_getroot(Relation rel, int access)
 		START_CRIT_SECTION();
 
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 
 		metad->btm_root = rootblkno;
@@ -400,6 +441,8 @@ _bt_getroot(Relation rel, int access)
 			XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
 			XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			md.version = metad->btm_version;
 			md.root = rootblkno;
 			md.level = 0;
 			md.fastroot = rootblkno;
@@ -595,37 +638,12 @@ _bt_getrootheight(Relation rel)
 {
 	BTMetaPageData *metad;
 
-	/*
-	 * We can get what we need from the cached metapage data.  If it's not
-	 * cached yet, load it.  Sanity checks here must match _bt_getroot().
-	 */
 	if (rel->rd_amcache == NULL)
 	{
 		Buffer		metabuf;
-		Page		metapg;
-		BTPageOpaque metaopaque;
 
 		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
-		metapg = BufferGetPage(metabuf);
-		metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
-		metad = BTPageGetMeta(metapg);
-
-		/* sanity-check the metapage */
-		if (!P_ISMETA(metaopaque) ||
-			metad->btm_magic != BTREE_MAGIC)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("index \"%s\" is not a btree",
-							RelationGetRelationName(rel))));
-
-		if (metad->btm_version < BTREE_MIN_VERSION ||
-			metad->btm_version > BTREE_VERSION)
-			ereport(ERROR,
-					(errcode(ERRCODE_INDEX_CORRUPTED),
-					 errmsg("version mismatch in index \"%s\": file version %d, "
-							"current version %d, minimal supported version %d",
-							RelationGetRelationName(rel),
-							metad->btm_version, BTREE_VERSION, BTREE_MIN_VERSION)));
+		metad = _bt_getmeta(rel, metabuf);
 
 		/*
 		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
@@ -642,19 +660,70 @@ _bt_getrootheight(Relation rel)
 		 * Cache the metapage data for next time
 		 */
 		_bt_cachemetadata(rel, metad);
-
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
 		_bt_relbuf(rel, metabuf);
 	}
 
+	/* Get cached page */
 	metad = (BTMetaPageData *) rel->rd_amcache;
-	/* We shouldn't have cached it if any of these fail */
-	Assert(metad->btm_magic == BTREE_MAGIC);
-	Assert(metad->btm_version == BTREE_VERSION);
-	Assert(metad->btm_fastroot != P_NONE);
 
 	return metad->btm_fastlevel;
 }
 
+/*
+ *	_bt_heapkeyspace() -- is heap TID being treated as a key?
+ *
+ *		This is used to determine the rules that must be used to descend a
+ *		btree.  Version 4 indexes treat heap TID as a tiebreaker attribute.
+ *		pg_upgrade'd version 3 indexes need extra steps to preserve reasonable
+ *		performance when inserting a new BTScanInsert-wise duplicate tuple
+ *		among many leaf pages already full of such duplicates.
+ */
+bool
+_bt_heapkeyspace(Relation rel)
+{
+	BTMetaPageData *metad;
+
+	if (rel->rd_amcache == NULL)
+	{
+		Buffer		metabuf;
+
+		metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+		metad = _bt_getmeta(rel, metabuf);
+
+		/*
+		 * If there's no root page yet, _bt_getroot() doesn't expect a cache
+		 * to be made, so just stop here.  (XXX perhaps _bt_getroot() should
+		 * be changed to allow this case.)
+		 */
+		if (metad->btm_root == P_NONE)
+		{
+			uint32		btm_version = metad->btm_version;
+
+			_bt_relbuf(rel, metabuf);
+			return btm_version > BTREE_NOVAC_VERSION;
+		}
+
+		/*
+		 * Cache the metapage data for next time
+		 */
+		_bt_cachemetadata(rel, metad);
+		/* We shouldn't have cached it if any of these fail */
+		Assert(metad->btm_magic == BTREE_MAGIC);
+		Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+		Assert(metad->btm_fastroot != P_NONE);
+		_bt_relbuf(rel, metabuf);
+	}
+
+	/* Get cached page */
+	metad = (BTMetaPageData *) rel->rd_amcache;
+
+	return metad->btm_version > BTREE_NOVAC_VERSION;
+}
+
 /*
  *	_bt_checkpage() -- Verify that a freshly-read page looks sane.
  */
@@ -1123,11 +1192,12 @@ _bt_is_page_halfdead(Relation rel, BlockNumber blk)
  * right sibling.
  *
  * "child" is the leaf page we wish to delete, and "stack" is a search stack
- * leading to it (approximately).  Note that we will update the stack
- * entry(s) to reflect current downlink positions --- this is essentially the
- * same as the corresponding step of splitting, and is not expected to affect
- * caller.  The caller should initialize *target and *rightsib to the leaf
- * page and its right sibling.
+ * leading to it (it actually leads to the leftmost leaf page with a high key
+ * matching that of the page to be deleted in !heapkeyspace indexes).  Note
+ * that we will update the stack entry(s) to reflect current downlink
+ * positions --- this is essentially the same as the corresponding step of
+ * splitting, and is not expected to affect caller.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
  * Note: it's OK to release page locks on any internal pages between the leaf
  * and *topparent, because a safe deletion can't become unsafe due to
@@ -1149,8 +1219,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BlockNumber leftsib;
 
 	/*
-	 * Locate the downlink of "child" in the parent (updating the stack entry
-	 * if needed)
+	 * Locate the downlink of "child" in the parent, updating the stack entry
+	 * if needed.  This is how !heapkeyspace indexes deal with having
+	 * non-unique high keys in leaf level pages.  Even heapkeyspace indexes
+	 * can have a stale stack due to insertions into the parent.
 	 */
 	stack->bts_btentry = child;
 	pbuf = _bt_getstackbuf(rel, stack);
@@ -1362,9 +1434,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 		{
 			/*
 			 * We need an approximate pointer to the page's parent page.  We
-			 * use the standard search mechanism to search for the page's high
-			 * key; this will give us a link to either the current parent or
-			 * someplace to its left (if there are multiple equal high keys).
+			 * use a variant of the standard search mechanism to search for
+			 * the page's high key; this will give us a link to either the
+			 * current parent or someplace to its left (if there are multiple
+			 * equal high keys, which is possible with !heapkeyspace indexes).
 			 *
 			 * Also check if this is the right-half of an incomplete split
 			 * (see comment above).
@@ -1422,7 +1495,8 @@ _bt_pagedel(Relation rel, Buffer buf)
 
 				/* we need an insertion scan key for the search, so build one */
 				itup_key = _bt_mkscankey(rel, targetkey);
-				/* get stack to leaf page by searching index */
+				/* find the leftmost leaf page with matching pivot/high key */
+				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &lbuf, BT_READ, NULL);
 				/* don't need a lock or second pin on the page */
 				_bt_relbuf(rel, lbuf);
@@ -1969,7 +2043,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 	if (BufferIsValid(metabuf))
 	{
 		/* upgrade metapage if needed */
-		if (metad->btm_version < BTREE_VERSION)
+		if (metad->btm_version < BTREE_NOVAC_VERSION)
 			_bt_upgrademetapage(metapg);
 		metad->btm_fastroot = rightsib;
 		metad->btm_fastlevel = targetlevel;
@@ -2017,6 +2091,8 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
 		{
 			XLogRegisterBuffer(4, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
 
+			Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+			xlmeta.version = metad->btm_version;
 			xlmeta.root = metad->btm_root;
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 60e0b90ccf2..ac6f1eb3423 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -794,7 +794,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	if (metad->btm_version < BTREE_VERSION)
+	if (metad->btm_version < BTREE_NOVAC_VERSION)
 	{
 		/*
 		 * Do cleanup if metapage needs upgrade, because we don't have
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0305469ad0a..f58774da826 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -152,8 +152,12 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * downlink (block) to uniquely identify the index entry, in case it
 		 * moves right while we're working lower in the tree.  See the paper
 		 * by Lehman and Yao for how this is detected and handled. (We use the
-		 * child link to disambiguate duplicate keys in the index -- Lehman
-		 * and Yao disallow duplicate keys.)
+		 * child link during the second half of a page split -- if caller ends
+		 * up splitting the child it usually ends up inserting a new pivot
+		 * tuple for child's new right sibling immediately after the original
+		 * bts_offset offset recorded here.  The downlink block will be needed
+		 * to check if bts_offset remains the position of this same pivot
+		 * tuple.)
 		 */
 		new_stack = (BTStack) palloc(sizeof(BTStackData));
 		new_stack->bts_blkno = par_blkno;
@@ -251,11 +255,13 @@ _bt_moveright(Relation rel,
 	/*
 	 * When nextkey = false (normal case): if the scan key that brought us to
 	 * this page is > the high key stored on the page, then the page has split
-	 * and we need to move right.  (If the scan key is equal to the high key,
-	 * we might or might not need to move right; have to scan the page first
-	 * anyway.)
+	 * and we need to move right.  (pg_upgrade'd !heapkeyspace indexes could
+	 * have some duplicates to the right as well as the left, but that's
+	 * something that's only ever dealt with on the leaf level, after
+	 * _bt_search has found an initial leaf page.)
 	 *
 	 * When nextkey = true: move right if the scan key is >= page's high key.
+	 * (Note that key.scantid cannot be set in this case.)
 	 *
 	 * The page could even have split more than once, so scan as far as
 	 * needed.
@@ -347,6 +353,9 @@ _bt_binsrch(Relation rel,
 	int32		result,
 				cmpval;
 
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -554,10 +563,14 @@ _bt_compare(Relation rel,
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	ItemPointer heapTid;
 	ScanKey		scankey;
+	int			ncmpkey;
+	int			ntupatts;
 
-	Assert(_bt_check_natts(rel, page, offnum));
+	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+	Assert(key->heapkeyspace || key->scantid == NULL);
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
@@ -567,6 +580,7 @@ _bt_compare(Relation rel,
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -580,8 +594,10 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
+	ncmpkey = Min(ntupatts, key->keysz);
+	Assert(key->heapkeyspace || ncmpkey == key->keysz);
 	scankey = key->scankeys;
-	for (int i = 1; i <= key->keysz; i++)
+	for (int i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -632,8 +648,77 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * All non-truncated attributes (other than heap TID) were found to be
+	 * equal.  Treat truncated attributes as minus infinity when scankey has a
+	 * key attribute value that would otherwise be compared directly.
+	 *
+	 * Note: it doesn't matter if ntupatts includes non-key attributes;
+	 * scankey won't, so explicitly excluding non-key attributes isn't
+	 * necessary.
+	 */
+	if (key->keysz > ntupatts)
+		return 1;
+
+	/*
+	 * Use the heap TID attribute and scantid to try to break the tie.  The
+	 * rules are the same as any other key attribute -- only the
+	 * representation differs.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (key->scantid == NULL)
+	{
+		/*
+		 * Most searches have a scankey that is considered greater than a
+		 * truncated pivot tuple if and when the scankey has equal values for
+		 * attributes up to and including the least significant untruncated
+		 * attribute in tuple.
+		 *
+		 * For example, if an index has the minimum two attributes (single
+		 * user key attribute, plus heap TID attribute), and a page's high key
+		 * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+		 * will not descend to the page to the left.  The search will descend
+		 * right instead.  The truncated attribute in pivot tuple means that
+		 * all non-pivot tuples on the page to the left are strictly < 'foo',
+		 * so it isn't necessary to descend left.  In other words, search
+		 * doesn't have to descend left because it isn't interested in a match
+		 * that has a heap TID value of -inf.
+		 *
+		 * However, some searches (pivotsearch searches) actually require that
+		 * we descend left when this happens.  -inf is treated as a possible
+		 * match for omitted scankey attribute(s).  This is needed by page
+		 * deletion, which must re-find leaf pages that are targets for
+		 * deletion using their high keys.
+		 *
+		 * Note: the heap TID part of the test ensures that scankey is being
+		 * compared to a pivot tuple with one or more truncated key
+		 * attributes.
+		 *
+		 * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+		 * left here, since they have no heap TID attribute (and cannot have
+		 * any -inf key values in any case, since truncation can only remove
+		 * non-key attributes).  !heapkeyspace searches must always be
+		 * prepared to deal with matches on both sides of the pivot once the
+		 * leaf level is reached.
+		 */
+		if (key->heapkeyspace && !key->pivotsearch &&
+			key->keysz == ntupatts && heapTid == NULL)
+			return 1;
+
+		/* All provided scankey arguments found to be equal */
+		return 0;
+	}
+
+	/*
+	 * Treat truncated heap TID as minus infinity, since scankey has a key
+	 * attribute value (scantid) that would otherwise be compared directly
+	 */
+	Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+	if (heapTid == NULL)
+		return 1;
+
+	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+	return ItemPointerCompare(key->scantid, heapTid);
 }
 
 /*
@@ -1148,7 +1233,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	}
 
 	/* Initialize remaining insertion scan key fields */
+	inskey.heapkeyspace = _bt_heapkeyspace(rel);
 	inskey.nextkey = nextkey;
+	inskey.pivotsearch = false;
+	inskey.scantid = NULL;
 	inskey.keysz = keysCount;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a0e2e70cefc..2762a2d5485 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -755,6 +755,7 @@ _bt_sortaddtup(Page page,
 	{
 		trunctuple = *itup;
 		trunctuple.t_info = sizeof(IndexTupleData);
+		/* Deliberately zero INDEX_ALT_TID_MASK bits */
 		BTreeTupleSetNAtts(&trunctuple, 0);
 		itup = &trunctuple;
 		itemsize = sizeof(IndexTupleData);
@@ -808,8 +809,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -826,27 +825,21 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	itupsz = MAXALIGN(itupsz);
 
 	/*
-	 * Check whether the item can fit on a btree page at all. (Eventually, we
-	 * ought to try to apply TOAST methods if not.) We actually need to be
-	 * able to fit three items on every page, so restrict any one item to 1/3
-	 * the per-page available space. Note that at this point, itupsz doesn't
-	 * include the ItemId.
+	 * Check whether the item can fit on a btree page at all.
 	 *
-	 * NOTE: similar code appears in _bt_insertonpg() to defend against
-	 * oversize items being inserted into an already-existing index. But
-	 * during creation of an index, we don't go through there.
+	 * Every newly built index will treat heap TID as part of the keyspace,
+	 * which imposes the requirement that new high keys must occasionally have
+	 * a heap TID appended within _bt_truncate().  That may leave a new pivot
+	 * tuple one or two MAXALIGN() quantums larger than the original first
+	 * right tuple it's derived from.  v4 deals with the problem by decreasing
+	 * the limit on the size of tuples inserted on the leaf level by the same
+	 * small amount.  Enforce the new v4+ limit on the leaf level, and the old
+	 * limit on internal levels, since pivot tuples may need to make use of
+	 * the resered space.  This should never fail on internal pages.
 	 */
-	if (itupsz > BTMaxItemSize(npage))
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
-						itupsz, BTMaxItemSize(npage),
-						RelationGetRelationName(wstate->index)),
-				 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
-						 "Consider a function index of an MD5 hash of the value, "
-						 "or use full text indexing."),
-				 errtableconstraint(wstate->heap,
-									RelationGetRelationName(wstate->index))));
+	if (unlikely(itupsz > BTMaxItemSize(npage)))
+		_bt_check_third_page(wstate->index, wstate->heap,
+							 state->btps_level == 0, npage, itup);
 
 	/*
 	 * Check to see if page is "full".  It's definitely full if the item won't
@@ -892,24 +885,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
+			IndexTuple	lastleft;
 			IndexTuple	truncated;
 			Size		truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks
 			 * in internal pages are either negative infinity items, or get
 			 * their contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
+			 * We don't try to bias our choice of split point to make it more
+			 * likely that _bt_truncate() can truncate away more attributes,
+			 * whereas the split point passed to _bt_split() is chosen much
+			 * more delicately.  Suffix truncation is mostly useful because it
+			 * improves space utilization for workloads with random
+			 * insertions.  It doesn't seem worthwhile to add logic for
+			 * choosing a split point here for a benefit that is bound to be
+			 * much smaller.
+			 *
 			 * Since the truncated tuple is probably smaller than the
 			 * original, it cannot just be copied in place (besides, we want
 			 * to actually save space on the leaf page).  We delete the
 			 * original high key, and add our own truncated high key at the
-			 * same offset.
+			 * same offset.  It's okay if the truncated tuple is slightly
+			 * larger due to containing a heap TID value, since this case is
+			 * known to _bt_check_third_page(), which reserves space.
 			 *
 			 * Note that the page layout won't be changed very much.  oitup is
 			 * already located at the physical beginning of tuple space, so we
@@ -917,7 +921,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
+			lastleft = (IndexTuple) PageGetItem(opage, ii);
+
+			truncated = _bt_truncate(wstate->index, lastleft, oitup,
+									 wstate->inskey);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -936,8 +944,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -982,7 +991,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1041,8 +1050,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1135,6 +1145,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1142,7 +1154,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1159,6 +1170,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all
+				 * keys in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 898679b44ef..2f9f6e76015 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,6 +49,8 @@ static void _bt_mark_scankey_required(ScanKey skey);
 static bool _bt_check_rowcompare(ScanKey skey,
 					 IndexTuple tuple, TupleDesc tupdesc,
 					 ScanDirection dir, bool *continuescan);
+static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
+			   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -56,9 +58,26 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().  Callers that don't
- *		need to fill out the insertion scankey arguments (e.g. they use an
- *		ad-hoc comparison routine) can pass a NULL index tuple.
+ *		When itup is a non-pivot tuple, the returned insertion scan key is
+ *		suitable for finding a place for it to go on the leaf level.  Pivot
+ *		tuples can be used to re-find leaf page with matching high key, but
+ *		then caller needs to set scan key's pivotsearch field to true.  This
+ *		allows caller to search for a leaf page with a matching high key,
+ *		which is usually to the left of the first leaf page a non-pivot match
+ *		might appear on.
+ *
+ *		The result is intended for use with _bt_compare() and _bt_truncate().
+ *		Callers that don't need to fill out the insertion scankey arguments
+ *		(e.g. they use an ad-hoc comparison routine, or only need a scankey
+ *		for _bt_truncate()) can pass a NULL index tuple.  The scankey will
+ *		be initialized as if an "all truncated" pivot tuple was passed
+ *		instead.
+ *
+ *		Note that we may occasionally have to share lock the metapage to
+ *		determine whether or not the keys in the index are expected to be
+ *		unique (i.e. if this is a "heapkeyspace" index).  We assume a
+ *		heapkeyspace index when caller passes a NULL tuple, allowing index
+ *		build callers to avoid accessing the non-existent metapage.
  */
 BTScanInsert
 _bt_mkscankey(Relation rel, IndexTuple itup)
@@ -79,13 +98,18 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
 
 	/*
-	 * We'll execute search using scan key constructed on key columns. Non-key
-	 * (INCLUDE index) columns are always omitted from scan keys.
+	 * We'll execute search using scan key constructed on key columns.
+	 * Truncated attributes and non-key attributes are omitted from the final
+	 * scan key.
 	 */
 	key = palloc(offsetof(BTScanInsertData, scankeys) +
 				 sizeof(ScanKeyData) * indnkeyatts);
+	key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
 	key->nextkey = false;
+	key->pivotsearch = false;
 	key->keysz = Min(indnkeyatts, tupnatts);
+	key->scantid = key->heapkeyspace && itup ?
+		BTreeTupleGetHeapTID(itup) : NULL;
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -101,9 +125,9 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
 
 		/*
-		 * Key arguments built when caller provides no tuple are
-		 * defensively represented as NULL values.  They should never be
-		 * used.
+		 * Key arguments built from truncated attributes (or when caller
+		 * provides no tuple) are defensively represented as NULL values. They
+		 * should never be used.
 		 */
 		if (i < tupnatts)
 			arg = index_getattr(itup, i + 1, itupdesc, &null);
@@ -2041,38 +2065,234 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_truncate() -- create tuple without unneeded suffix attributes.
  *
- * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * Returns truncated pivot index tuple allocated in caller's memory context,
+ * with key attributes copied from caller's firstright argument.  If rel is
+ * an INCLUDE index, non-key attributes will definitely be truncated away,
+ * since they're not part of the key space.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  We only need to keep firstright
+ * attributes up to and including the first non-lastleft-equal attribute.
+ * Caller's insertion scankey is used to compare the tuples; the scankey's
+ * argument values are not considered here.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Sometimes this routine will return a new pivot tuple that takes up more
+ * space than firstright, because a new heap TID attribute had to be added to
+ * distinguish lastleft from firstright.  This should only happen when the
+ * caller is in the process of splitting a leaf page that has many logical
+ * duplicates, where it's unavoidable.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated
+ * key attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
+ *
+ * In the worst case (when a heap TID is appended) the size of the returned
+ * tuple is the size of the first right tuple plus an additional MAXALIGN()'d
+ * item pointer.  This guarantee is important, since callers need to stay
+ * under the 1/3 of a page restriction on tuple size.  If this routine is ever
+ * taught to truncate within an attribute/datum, it will need to avoid
+ * returning an enlarged tuple to caller when truncation + TOAST compression
+ * ends up enlarging the final datum.
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			 BTScanInsert itup_key)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int16		natts = IndexRelationGetNumberOfAttributes(rel);
+	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			keepnatts;
+	IndexTuple	pivot;
+	ItemPointer pivotheaptid;
+	Size		newsize;
+
+	/*
+	 * We should only ever truncate leaf index tuples.  It's never okay to
+	 * truncate a second time.
+	 */
+	Assert(BTreeTupleGetNAtts(lastleft, rel) == natts);
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
+
+	/* Determine how many attributes must be kept in truncated tuple */
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+
+#ifdef DEBUG_NO_TRUNCATE
+	/* Force truncation to be ineffective for testing purposes */
+	keepnatts = nkeyatts + 1;
+#endif
+
+	if (keepnatts <= natts)
+	{
+		IndexTuple	tidpivot;
+
+		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+
+		/*
+		 * If there is a distinguishing key attribute within new pivot tuple,
+		 * there is no need to add an explicit heap TID attribute
+		 */
+		if (keepnatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			return pivot;
+		}
+
+		/*
+		 * Only truncation of non-key attributes was possible, since key
+		 * attributes are all equal.  It's necessary to add a heap TID
+		 * attribute to the new pivot tuple.
+		 */
+		Assert(natts != nkeyatts);
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		/* cannot leak memory here */
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		Assert(natts == nkeyatts);
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We have to use heap TID as a unique-ifier in the new pivot tuple, since
+	 * no non-TID key attribute in the right item readily distinguishes the
+	 * right side of the split from the left side.  Use enlarged space that
+	 * holds a copy of first right tuple; place a heap TID value within the
+	 * extra space that remains at the end.
+	 *
+	 * nbtree conceptualizes this case as an inability to truncate away any
+	 * key attribute.  We must use an alternative representation of heap TID
+	 * within pivots because heap TID is only treated as an attribute within
+	 * nbtree (e.g., there is no pg_attribute entry).
+	 */
+	Assert(itup_key->heapkeyspace);
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+
+	/*
+	 * Lehman & Yao use lastleft as the leaf high key in all cases, but don't
+	 * consider suffix truncation.  It seems like a good idea to follow that
+	 * example in cases where no truncation takes place -- use lastleft's heap
+	 * TID.  (This is also the closest value to negative infinity that's
+	 * legally usable.)
+	 */
+	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
+								  sizeof(ItemPointerData));
+	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+
+	/*
+	 * Lehman and Yao require that the downlink to the right page, which is to
+	 * be inserted into the parent page in the second phase of a page split be
+	 * a strict lower bound on items on the right page, and a non-strict upper
+	 * bound for items on the left page.  Assert that heap TIDs follow these
+	 * invariants, since a heap TID value is apparently needed as a
+	 * tiebreaker.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+#ifndef DEBUG_NO_TRUNCATE
+	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#else
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/*
+	 * Those invariants aren't guaranteed to hold for lastleft + firstright
+	 * heap TID attribute values when they're considered here only because
+	 * DEBUG_NO_TRUNCATE is defined (a heap TID is probably not actually
+	 * needed as a tiebreaker).  DEBUG_NO_TRUNCATE must therefore use a heap
+	 * TID value that always works as a strict lower bound for items to the
+	 * right.  In particular, it must avoid using firstright's leading key
+	 * attribute values along with lastleft's heap TID value when lastleft's
+	 * TID happens to be greater than firstright's TID.
+	 */
+	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
 
-	return truncated;
+	/*
+	 * Pivot heap TID should never be fully equal to firstright.  Note that
+	 * the pivot heap TID will still end up equal to lastleft's heap TID when
+	 * that's the only usable value.
+	 */
+	ItemPointerSetOffsetNumber(pivotheaptid,
+							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
+	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+#endif
+
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetAltHeapTID(pivot);
+
+	return pivot;
+}
+
+/*
+ * _bt_keep_natts - how many key attributes to keep when truncating.
+ *
+ * Caller provides two tuples that enclose a split point.  Caller's insertion
+ * scankey is used to compare the tuples; the scankey's argument values are
+ * not considered here.
+ *
+ * This can return a number of attributes that is one greater than the
+ * number of key attributes for the index relation.  This indicates that the
+ * caller must use a heap TID as a unique-ifier in new pivot tuple.
+ */
+static int
+_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+			   BTScanInsert itup_key)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	int			keepnatts;
+	ScanKey		scankey;
+
+	/*
+	 * Be consistent about the representation of BTREE_VERSION 2/3 tuples
+	 * across Postgres versions; don't allow new pivot tuples to have
+	 * truncated key attributes there.  _bt_compare() treats truncated key
+	 * attributes as having the value minus infinity, which would break
+	 * searches within !heapkeyspace indexes.
+	 */
+	if (!itup_key->heapkeyspace)
+	{
+		Assert(nkeyatts != IndexRelationGetNumberOfAttributes(rel));
+		return nkeyatts;
+	}
+
+	scankey = itup_key->scankeys;
+	keepnatts = 1;
+	for (int attnum = 1; attnum <= nkeyatts; attnum++, scankey++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isNull1,
+					isNull2;
+
+		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
+		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+
+		if (isNull1 != isNull2)
+			break;
+
+		if (!isNull1 &&
+			DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+											scankey->sk_collation,
+											datum1,
+											datum2)) != 0)
+			break;
+
+		keepnatts++;
+	}
+
+	return keepnatts;
 }
 
 /*
@@ -2086,15 +2306,17 @@ _bt_nonkey_truncate(Relation rel, IndexTuple itup)
  * preferred to calling here.  That's usually more convenient, and is always
  * more explicit.  Call here instead when offnum's tuple may be a negative
  * infinity tuple that uses the pre-v11 on-disk representation, or when a low
- * context check is appropriate.
+ * context check is appropriate.  This routine is as strict as possible about
+ * what is expected on each version of btree.
  */
 bool
-_bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
+_bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 {
 	int16		natts = IndexRelationGetNumberOfAttributes(rel);
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2114,16 +2336,26 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
+			/*
+			 * Non-pivot tuples currently never use alternative heap TID
+			 * representation -- even those within heapkeyspace indexes
+			 */
+			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+				return false;
+
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
-			 * should never be truncated
+			 * should never be truncated.  (Note that tupnatts must have been
+			 * inferred, rather than coming from an explicit on-disk
+			 * representation.)
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2133,8 +2365,15 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 */
 			Assert(!P_RIGHTMOST(opaque));
 
-			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			/*
+			 * !heapkeyspace high key tuple contains only key attributes. Note
+			 * that tupnatts will only have been explicitly represented in
+			 * !heapkeyspace indexes that happen to have non-key attributes.
+			 */
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2146,7 +2385,11 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * its high key) is its negative infinity tuple.  Negative
 			 * infinity tuples are always truncated to zero attributes.  They
 			 * are a particular kind of pivot tuple.
-			 *
+			 */
+			if (heapkeyspace)
+				return tupnatts == 0;
+
+			/*
 			 * The number of attributes won't be explicitly represented if the
 			 * negative infinity tuple was generated during a page split that
 			 * occurred with a version of Postgres before v11.  There must be
@@ -2157,18 +2400,109 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
 			 * that to decide if the tuple is a pre-v11 tuple.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == 0 ||
+			return tupnatts == 0 ||
 				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
 				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
 		}
 		else
 		{
 			/*
-			 * Tuple contains only key attributes despite on is it page high
-			 * key or not
+			 * !heapkeyspace downlink tuple with separator key contains only
+			 * key attributes.  Note that tupnatts will only have been
+			 * explicitly represented in !heapkeyspace indexes that happen to
+			 * have non-key attributes.
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			if (!heapkeyspace)
+				return tupnatts == nkeyatts;
+
+			/* Use generic heapkeyspace pivot tuple handling */
 		}
 
 	}
+
+	/* Handle heapkeyspace pivot tuples (excluding minus infinity items) */
+	Assert(heapkeyspace);
+
+	/*
+	 * Explicit representation of the number of attributes is mandatory with
+	 * heapkeyspace index pivot tuples, regardless of whether or not there are
+	 * non-key attributes.
+	 */
+	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+		return false;
+
+	/*
+	 * Heap TID is a tiebreaker key attribute, so it cannot be untruncated
+	 * when any other key attribute is truncated
+	 */
+	if (BTreeTupleGetHeapTID(itup) != NULL && tupnatts != nkeyatts)
+		return false;
+
+	/*
+	 * Pivot tuple must have at least one untruncated key attribute (minus
+	 * infinity pivot tuples are the only exception).  Pivot tuples can never
+	 * represent that there is a value present for a key attribute that
+	 * exceeds pg_index.indnkeyatts for the index.
+	 */
+	return tupnatts > 0 && tupnatts <= nkeyatts;
+}
+
+/*
+ *
+ *  _bt_check_third_page() -- check whether tuple fits on a btree page at all.
+ *
+ * We actually need to be able to fit three items on every page, so restrict
+ * any one item to 1/3 the per-page available space.  Note that itemsz should
+ * not include the ItemId overhead.
+ *
+ * It might be useful to apply TOAST methods rather than throw an error here.
+ * Using out of line storage would break assumptions made by suffix truncation
+ * and by contrib/amcheck, though.
+ */
+void
+_bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
+					 Page page, IndexTuple newtup)
+{
+	Size		itemsz;
+	BTPageOpaque opaque;
+
+	itemsz = MAXALIGN(IndexTupleSize(newtup));
+
+	/* Double check item size against limit */
+	if (itemsz <= BTMaxItemSize(page))
+		return;
+
+	/*
+	 * Tuple is probably too large to fit on page, but it's possible that the
+	 * index uses version 2 or version 3, or that page is an internal page, in
+	 * which case a slightly higher limit applies.
+	 */
+	if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
+		return;
+
+	/*
+	 * Internal page insertions cannot fail here, because that would mean that
+	 * an earlier leaf level insertion that should have failed didn't
+	 */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (!P_ISLEAF(opaque))
+		elog(ERROR, "cannot insert oversized tuple of size %zu on internal page of index \"%s\"",
+			 itemsz, RelationGetRelationName(rel));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+			 errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"",
+					itemsz,
+					needheaptidspace ? BTREE_VERSION : BTREE_NOVAC_VERSION,
+					needheaptidspace ? BTMaxItemSize(page) :
+					BTMaxItemSizeNoHeapTid(page),
+					RelationGetRelationName(rel)),
+			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
+					   ItemPointerGetBlockNumber(&newtup->t_tid),
+					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   RelationGetRelationName(heap)),
+			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
+					 "Consider a function index of an MD5 hash of the value, "
+					 "or use full text indexing."),
+			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b0666b42df3..7f261db9017 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -103,7 +103,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 
 	md = BTPageGetMeta(metapg);
 	md->btm_magic = BTREE_MAGIC;
-	md->btm_version = BTREE_VERSION;
+	md->btm_version = xlrec->version;
 	md->btm_root = xlrec->root;
 	md->btm_level = xlrec->level;
 	md->btm_fastroot = xlrec->fastroot;
@@ -202,7 +202,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 }
 
 static void
-btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
+btree_xlog_split(bool onleft, XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
@@ -213,8 +213,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	BTPageOpaque ropaque;
 	char	   *datapos;
 	Size		datalen;
-	IndexTuple	left_hikey = NULL;
-	Size		left_hikeysz = 0;
 	BlockNumber leftsib;
 	BlockNumber rightsib;
 	BlockNumber rnext;
@@ -248,20 +246,6 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 
 	_bt_restore_page(rpage, datapos, datalen);
 
-	/*
-	 * When the high key isn't present is the wal record, then we assume it to
-	 * be equal to the first key on the right page.  It must be from the leaf
-	 * level.
-	 */
-	if (!lhighkey)
-	{
-		ItemId		hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
-
-		Assert(isleaf);
-		left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
-		left_hikeysz = ItemIdGetLength(hiItemId);
-	}
-
 	PageSetLSN(rpage, lsn);
 	MarkBufferDirty(rbuf);
 
@@ -282,8 +266,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		Page		lpage = (Page) BufferGetPage(lbuf);
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
-		IndexTuple	newitem = NULL;
-		Size		newitemsz = 0;
+		IndexTuple	newitem,
+					left_hikey;
+		Size		newitemsz,
+					left_hikeysz;
 		Page		newlpage;
 		OffsetNumber leftoff;
 
@@ -298,13 +284,10 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
-		if (lhighkey)
-		{
-			left_hikey = (IndexTuple) datapos;
-			left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-			datapos += left_hikeysz;
-			datalen -= left_hikeysz;
-		}
+		left_hikey = (IndexTuple) datapos;
+		left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+		datapos += left_hikeysz;
+		datalen -= left_hikeysz;
 
 		Assert(datalen == 0);
 
@@ -1003,16 +986,10 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_insert(false, true, record);
 			break;
 		case XLOG_BTREE_SPLIT_L:
-			btree_xlog_split(true, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			btree_xlog_split(true, true, record);
+			btree_xlog_split(true, record);
 			break;
 		case XLOG_BTREE_SPLIT_R:
-			btree_xlog_split(false, false, record);
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			btree_xlog_split(false, true, record);
+			btree_xlog_split(false, record);
 			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 8d5c6ae0ab0..fcac0cd8a93 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -35,8 +35,6 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			}
 		case XLOG_BTREE_SPLIT_L:
 		case XLOG_BTREE_SPLIT_R:
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
@@ -130,12 +128,6 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
-		case XLOG_BTREE_SPLIT_L_HIGHKEY:
-			id = "SPLIT_L_HIGHKEY";
-			break;
-		case XLOG_BTREE_SPLIT_R_HIGHKEY:
-			id = "SPLIT_R_HIGHKEY";
-			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 16bda5c586a..3eebd9ef51c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required for
+	 * btree indexes, since heap TID is treated as an implicit last key
+	 * attribute in order to ensure that all keys in the index are physically
+	 * unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
@@ -4076,6 +4077,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
@@ -4128,6 +4132,9 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 			return (pos1 < pos2) ? -1 : 1;
 	}
 
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
 	return 0;
 }
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d7e83b0a9d2..624efff3832 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -112,18 +112,45 @@ typedef struct BTMetaPageData
 #define BTPageGetMeta(p) \
 	((BTMetaPageData *) PageGetContents(p))
 
+/*
+ * The current Btree version is 4.  That's what you'll get when you create
+ * a new index.
+ *
+ * Btree version 3 was used in PostgreSQL v11.  It is mostly the same as
+ * version 4, but heap TIDs were not part of the keyspace.  Index tuples
+ * with duplicate keys could be stored in any order.  We continue to
+ * support reading and writing Btree versions 2 and 3, so that they don't
+ * need to be immediately re-indexed at pg_upgrade.  In order to get the
+ * new heapkeyspace semantics, however, a REINDEX is needed.
+ *
+ * Btree version 2 is mostly the same as version 3.  There are two new
+ * fields in the metapage that were introduced in version 3.  A version 2
+ * metapage will be automatically upgraded to version 3 on the first
+ * insert to it.  INCLUDE indexes cannot use version 2.
+ */
 #define BTREE_METAPAGE	0		/* first page is meta */
-#define BTREE_MAGIC		0x053162	/* magic number of btree pages */
-#define BTREE_VERSION	3		/* current version number */
+#define BTREE_MAGIC		0x053162	/* magic number in metapage */
+#define BTREE_VERSION	4		/* current version number */
 #define BTREE_MIN_VERSION	2	/* minimal supported version number */
+#define BTREE_NOVAC_VERSION	3	/* minimal version with all meta fields */
 
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_truncate() will need to enlarge
+ * a heap index tuple to make space for a tiebreaker heap TID
+ * attribute, which we account for here.
  */
 #define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*sizeof(ItemPointerData)) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSizeNoHeapTid(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
@@ -187,38 +214,84 @@ typedef struct BTMetaPageData
 #define P_FIRSTDATAKEY(opaque)	(P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
 
 /*
+ *
+ * Notes on B-Tree tuple format, and key and non-key attributes:
+ *
  * INCLUDE B-Tree indexes have non-key attributes.  These are extra
  * attributes that may be returned by index-only scans, but do not influence
  * the order of items in the index (formally, non-key attributes are not
  * considered to be part of the key space).  Non-key attributes are only
  * present in leaf index tuples whose item pointers actually point to heap
- * tuples.  All other types of index tuples (collectively, "pivot" tuples)
- * only have key attributes, since pivot tuples only ever need to represent
- * how the key space is separated.  In general, any B-Tree index that has
- * more than one level (i.e. any index that does not just consist of a
- * metapage and a single leaf root page) must have some number of pivot
- * tuples, since pivot tuples are used for traversing the tree.
- *
- * We store the number of attributes present inside pivot tuples by abusing
- * their item pointer offset field, since pivot tuples never need to store a
- * real offset (downlinks only need to store a block number).  The offset
- * field only stores the number of attributes when the INDEX_ALT_TID_MASK
- * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
- *
- * The 12 least significant offset bits are used to represent the number of
- * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
- * for future use (BT_RESERVED_OFFSET_MASK bits). BT_N_KEYS_OFFSET_MASK should
- * be large enough to store any number <= INDEX_MAX_KEYS.
+ * tuples (non-pivot tuples).  _bt_check_natts() enforces the rules
+ * described here.
+ *
+ * Non-pivot tuple format:
+ *
+ *  t_tid | t_info | key values | INCLUDE columns, if any
+ *
+ * t_tid points to the heap TID, which is a tiebreaker key column as of
+ * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
+ * set for non-pivot tuples.
+ *
+ * All other types of index tuples ("pivot" tuples) only have key columns,
+ * since pivot tuples only exist to represent how the key space is
+ * separated.  In general, any B-Tree index that has more than one level
+ * (i.e. any index that does not just consist of a metapage and a single
+ * leaf root page) must have some number of pivot tuples, since pivot
+ * tuples are used for traversing the tree.  Suffix truncation can omit
+ * trailing key columns when a new pivot is formed, which makes minus
+ * infinity their logical value.  Since BTREE_VERSION 4 indexes treat heap
+ * TID as a trailing key column that ensures that all index tuples are
+ * physically unique, it is necessary to represent heap TID as a trailing
+ * key column in pivot tuples, though very often this can be truncated
+ * away, just like any other key column. (Actually, the heap TID is
+ * omitted rather than truncated, since its representation is different to
+ * the non-pivot representation.)
+ *
+ * Pivot tuple format:
+ *
+ *  t_tid | t_info | key values | [heap TID]
+ *
+ * We store the number of columns present inside pivot tuples by abusing
+ * their t_tid offset field, since pivot tuples never need to store a real
+ * offset (downlinks only need to store a block number in t_tid).  The
+ * offset field only stores the number of columns/attributes when the
+ * INDEX_ALT_TID_MASK bit is set, which doesn't count the trailing heap
+ * TID column sometimes stored in pivot tuples -- that's represented by
+ * the presence of BT_HEAP_TID_ATTR.  The INDEX_ALT_TID_MASK bit in t_info
+ * is always set on BTREE_VERSION 4.  BT_HEAP_TID_ATTR can only be set on
+ * BTREE_VERSION 4.
+ *
+ * In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set in
+ * pivot tuples.  In that case, the number of key columns is implicitly
+ * the same as the number of key columns in the index.  It is not usually
+ * set on version 2 indexes, which predate the introduction of INCLUDE
+ * indexes.  (Only explicitly truncated pivot tuples explicitly represent
+ * the number of key columns on versions 2 and 3, whereas all pivot tuples
+ * are formed using truncation on version 4.  A version 2 index will have
+ * it set for an internal page negative infinity item iff internal page
+ * split occurred after upgrade to Postgres 11+.)
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of columns in INDEX_ALT_TID_MASK tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
+ * number of columns/attributes <= INDEX_MAX_KEYS.
+ *
+ * Note well: The macros that deal with the number of attributes in tuples
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
+ * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
+ * tuple (or must have the same number of attributes as the index has
+ * generally in the case of !heapkeyspace indexes).  They will need to be
+ * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
+ * for something else.
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
+
+/* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +314,16 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tiebreaker heap-TID
+ * attribute, if any.  Note also that the number of key attributes must be
+ * explicitly represented in all heapkeyspace pivot tuples.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,10 +332,34 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
+/*
+ * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
+ * and non-pivot tuples, despite differences in how heap TID is represented.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   sizeof(ItemPointerData)) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+	)
+/*
+ * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
+ * representation (currently limited to pivot tuples)
+ */
+#define BTreeTupleSetAltHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
 /*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
@@ -325,22 +424,46 @@ typedef BTStackData *BTStack;
  * be confused with a search scankey).  It's used to descend a B-Tree using
  * _bt_search.
  *
+ * heapkeyspace indicates if we expect all keys in the index to be
+ * physically unique because heap TID is used as a tiebreaker attribute, and
+ * if index may have truncated key attributes in pivot tuples.  This is
+ * always the case with indexes whose version is >= version 4 (and never the
+ * case with earlier versions).  This state will never vary for the same
+ * index (unless there is a REINDEX of a pg_upgrade'd index), but it's
+ * convenient to keep it close by.
+ *
  * When nextkey is false (the usual case), _bt_search and _bt_binsrch will
  * locate the first item >= scankey.  When nextkey is true, they will locate
  * the first item > scan key.
  *
- * keysz is the number of insertion scankeys present.
+ * pivotsearch is set to true by callers that want to re-find a leaf page
+ * using a scankey built from leaf pages high key.  Most callers set this to
+ * false.
+ *
+ * scantid is the heap TID that is used as a final tiebreaker attribute,
+ * which may be set to NULL to indicate its absence.  Must be set when
+ * inserting new tuples into heapkeyspace indexes, since every tuple in
+ * the tree unambiguously belongs in one exact position (it's never set
+ * with !heapkeyspace indexes, though).  Despite the representational
+ * difference, nbtree search code considers scantid to be just another
+ * insertion scankey attribute.
+ *
+ * keysz is the number of insertion scankeys present, not including scantid.
  *
- * scankeys is an array of scan key entries for attributes that are compared.
- * During insertion, there must be a scan key for every attribute, but when
- * starting a regular index scan some can be omitted.  The array is used as a
- * flexible array member, though it's sized in a way that makes it possible to
- * use stack allocations.  See nbtree/README for full details.
+ * scankeys is an array of scan key entries for attributes that are compared
+ * before scantid (user-visible attributes).  During insertion, there must be
+ * a scan key for every attribute, but when starting a regular index scan some
+ * can be omitted.  The array is used as a flexible array member, though it's
+ * sized in a way that makes it possible to use stack allocations.  See
+ * nbtree/README for full details.
  */
 typedef struct BTScanInsertData
 {
 	/* State used to locate a position at the leaf level */
+	bool		heapkeyspace;
 	bool		nextkey;
+	bool		pivotsearch;	/* Searching for pivot tuple? */
+	ItemPointer scantid;		/* tiebreaker for scankeys */
 	int			keysz;			/* Size of scankeys */
 	ScanKeyData scankeys[INDEX_MAX_KEYS];	/* Must appear last */
 } BTScanInsertData;
@@ -601,6 +724,7 @@ extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern int	_bt_getrootheight(Relation rel);
+extern bool _bt_heapkeyspace(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -654,8 +778,12 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
-extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
+extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
+			 IndexTuple firstright, BTScanInsert itup_key);
+extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
+				OffsetNumber offnum);
+extern void _bt_check_third_page(Relation rel, Relation heap,
+					 bool needheaptidspace, Page page, IndexTuple newtup);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index a605851c981..6320a0098ff 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,8 +28,7 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */
-#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */
+/* 0x50 and 0x60 are unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -47,6 +46,7 @@
  */
 typedef struct xl_btree_metadata
 {
+	uint32		version;
 	BlockNumber root;
 	uint32		level;
 	BlockNumber fastroot;
@@ -80,27 +80,30 @@ typedef struct xl_btree_insert
  * whole page image.  The left page, however, is handled in the normal
  * incremental-update fashion.
  *
- * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
- * The _L and _R variants indicate whether the inserted tuple went into the
- * left or right split page (and thus, whether newitemoff and the new item
- * are stored or not).  The _HIGHKEY variants indicate that we've logged
- * explicitly left page high key value, otherwise redo should use right page
- * leftmost key as a left page high key.  _HIGHKEY is specified for internal
- * pages where right page leftmost key is suppressed, and for leaf pages
- * of covering indexes where high key have non-key attributes truncated.
+ * Note: XLOG_BTREE_SPLIT_L and XLOG_BTREE_SPLIT_R share this data record.
+ * There are two variants to indicate whether the inserted tuple went into the
+ * left or right split page (and thus, whether newitemoff and the new item are
+ * stored or not).  We always log the left page high key because suffix
+ * truncation can generate a new leaf high key using user-defined code.  This
+ * is also necessary on internal pages, since the first right item that the
+ * left page's high key was based on will have been truncated to zero
+ * attributes in the right page (the original is unavailable from the right
+ * page).
  *
  * Backup Blk 0: original page / new left page
  *
  * The left page's data portion contains the new item, if it's the _L variant.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * If level > 0, an IndexTuple representing the HIKEY of the left page
- * follows.  We don't need this on leaf pages, because it's the same as the
- * leftmost key in the new right page.
+ * An IndexTuple representing the high key of the left page must follow with
+ * either variant.
  *
  * Backup Blk 1: new right page
  *
- * The right page's data portion contains the right page's tuples in the
- * form used by _bt_restore_page.
+ * The right page's data portion contains the right page's tuples in the form
+ * used by _bt_restore_page.  This includes the new item, if it's the _R
+ * variant.  The right page's tuples also include the right page's high key
+ * with either variant (moved from the left/original page during the split),
+ * unless the split happened to be of the rightmost page on its level, where
+ * there is no high key for new right page.
  *
  * Backup Blk 2: next block (orig page's rightlink), if any
  * Backup Blk 3: child's left sibling, if non-leaf split
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index b21298a2a6b..ff443a476c5 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,28 +199,22 @@ reset enable_seqscan;
 reset enable_indexscan;
 reset enable_bitmapscan;
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
-create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
---
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
 --
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
+create table btree_tall_tbl(id int4, t text);
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 --
 -- Test vacuum_cleanup_index_scale_factor
 --
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 5d4eb59a0ce..54d3eee1979 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3225,11 +3225,22 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 --
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+--
 -- REINDEX (VERBOSE)
 --
 CREATE TABLE reindex_verbose(id integer primary key);
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8ffbb0..8d31110b874 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index d0c9f9a67f8..f7891faa23c 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -187,9 +187,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 4d82d3a7e84..0b7582accbd 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -1995,16 +1995,13 @@ ERROR:  cannot attach a permanent relation as partition of temporary relation "t
 DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
-privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 NOTICE:  drop cascades to 5 other objects
 DROP SERVER s8 CASCADE;
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index 2e170497c9d..bad5199d9ee 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3503,8 +3503,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 2b087be796c..19fbfa8b720 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -84,32 +84,23 @@ reset enable_indexscan;
 reset enable_bitmapscan;
 
 --
--- Test B-tree page deletion. In particular, deleting a non-leaf page.
+-- Test B-tree fast path (cache rightmost leaf page) optimization.
 --
 
--- First create a tree that's at least four levels deep. The text inserted
--- is long and poorly compressible. That way only a few index tuples fit on
--- each page, allowing us to get a tall tree with fewer pages.
-create table btree_tall_tbl(id int4, t text);
-create index btree_tall_idx on btree_tall_tbl (id, t) with (fillfactor = 10);
-insert into btree_tall_tbl
-  select g, g::text || '_' ||
-          (select string_agg(md5(i::text), '_') from generate_series(1, 50) i)
-from generate_series(1, 100) g;
-
--- Delete most entries, and vacuum. This causes page deletions.
-delete from btree_tall_tbl where id < 950;
-vacuum btree_tall_tbl;
-
+-- First create a tree that's at least three levels deep (i.e. has one level
+-- between the root and leaf levels). The text inserted is long.  It won't be
+-- compressed because we use plain storage in the table.  Only a few index
+-- tuples fit on each internal page, allowing us to get a tall tree with few
+-- pages.  (A tall tree is required to trigger caching.)
 --
--- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
--- WAL record type). This happens when a "fast root" page is split.
---
-
--- The vacuum above should've turned the leaf page into a fast root. We just
--- need to insert some rows to cause the fast root page to split.
-insert into btree_tall_tbl (id, t)
-  select g, repeat('x', 100) from generate_series(1, 500) g;
+-- The text column must be the leading column in the index, since suffix
+-- truncation would otherwise truncate tuples on internal pages, leaving us
+-- with a short tree.
+create table btree_tall_tbl(id int4, t text);
+alter table btree_tall_tbl alter COLUMN t set storage plain;
+create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
+insert into btree_tall_tbl select g, repeat('x', 250)
+from generate_series(1, 130) g;
 
 --
 -- Test vacuum_cleanup_index_scale_factor
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 67ecad8dd5e..4487421ef30 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1146,11 +1146,23 @@ explain (costs off)
 CREATE TABLE delete_test_table (a bigint, b bigint, c bigint, d bigint);
 INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,80000) i;
 ALTER TABLE delete_test_table ADD PRIMARY KEY (a,b,c,d);
+-- Delete many entries, and vacuum. This causes page deletions.
 DELETE FROM delete_test_table WHERE a > 40000;
 VACUUM delete_test_table;
-DELETE FROM delete_test_table WHERE a > 10;
+-- Delete most entries, and vacuum, deleting internal pages and creating "fast
+-- root"
+DELETE FROM delete_test_table WHERE a < 79990;
 VACUUM delete_test_table;
 
+--
+-- Test B-tree insertion with a metapage update (XLOG_BTREE_INSERT_META
+-- WAL record type). This happens when a "fast root" page is split.  This
+-- also creates coverage for nbtree FSM page recycling.
+--
+-- The vacuum above should've turned the leaf page into a fast root. We just
+-- need to insert some rows to cause the fast root page to split.
+INSERT INTO delete_test_table SELECT i, 1, 2, 3 FROM generate_series(1,1000) i;
+
 --
 -- REINDEX (VERBOSE)
 --
diff --git a/src/test/regress/sql/foreign_data.sql b/src/test/regress/sql/foreign_data.sql
index d6fb3fae4e1..1cc1f6e0129 100644
--- a/src/test/regress/sql/foreign_data.sql
+++ b/src/test/regress/sql/foreign_data.sql
@@ -805,11 +805,11 @@ DROP FOREIGN TABLE foreign_part;
 DROP TABLE temp_parted;
 
 -- Cleanup
+\set VERBOSITY terse
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 DROP SERVER t1 CASCADE;
 DROP USER MAPPING FOR regress_test_role SERVER s6;
-\set VERBOSITY terse
 DROP FOREIGN DATA WRAPPER foo CASCADE;
 DROP SERVER s8 CASCADE;
 \set VERBOSITY default
-- 
2.20.1

#116

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Heikki Linnakangas (#115)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm getting a regression failure from the 'create_table' test with this:

Are you seeing that?

Yes -- though the bug is in your revised v18, not the original v18,
which passed CFTester. Your revision fails on Travis/Linux, which is
pretty close to what I see locally, and much less subtle than the test
failures you mentioned:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665

"make check" did pass locally for me with your patch, but "make
check-world" (parallel recipe) did not.

The original v18 passed both CFTester tests about 15 hour ago:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402

I see the bug. You're not supposed to test this way with a heapkeyspace index:

+               if (P_RIGHTMOST(lpageop) ||
+                   _bt_compare(rel, itup_key, page, P_HIKEY) != 0)
+                   break;

This is because the presence of scantid makes it almost certain that
you'll break when you shouldn't. You're doing it the old way, which is
inappropriate for a heapkeyspace index. Note that it would probably
take much longer to notice this bug if the "consider secondary
factors" patch was also applied, because then you would rarely have
cause to step right here (duplicates would never occupy more than a
single page in the regression tests). The test failures are probably
also timing sensitive, though they happen very reliably on my machine.

Looking at the patches 1 and 2 again:

I'm still not totally happy with the program flow and all the conditions
in _bt_findsplitloc(). I have a hard time following which codepaths are
taken when. I refactored that, so that there is a separate copy of the
loop for V3 and V4 indexes.

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

The flow of the original loop not only had less code. It also
contrasted the important differences between heapkeyspace and
!heapkeyspace cases:

/* If this is the page that the tuple must go on, stop */
if (P_RIGHTMOST(lpageop))
break;
cmpval = _bt_compare(rel, itup_key, page, P_HIKEY);
if (itup_key->heapkeyspace)
{
if (cmpval <= 0)
break;
}
else
{
/*
* pg_upgrade'd !heapkeyspace index.
*
* May have to handle legacy case where there is a choice of which
* page to place new tuple on, and we must balance space
* utilization as best we can. Note that this may invalidate
* cached bounds for us.
*/
if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate))
break;
}

I thought it was obvious that the "cmpval <= 0" code was different for
a reason. It now seems that this at least needs a comment.

I still believe that the best way to handle the !heapkeyspace case is
to make it similar to the heapkeyspace checkingunique case, regardless
of whether or not we're checkingunique. The fact that this bug slipped
in supports that view. Besides, the alternative that you suggest
treats !heapkeyspace indexes as if they were just as important to the
reader, which seems inappropriate (better to make the legacy case
follow the new case, not the other way around). I'm fine with the
comment tweaks that you made that are not related to
_bt_findsplitloc(), though.

I won't push the patches today, to give you the opportunity to
respond. I am not at all convinced right now, though.

--
Peter Geoghegan

#117

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#92)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think it's pretty clear that we have to view that as acceptable. I
mean, we could reduce contention even further by finding a way to make
indexes 40% larger, but I think it's clear that nobody wants that.
Now, maybe in the future we'll want to work on other techniques for
reducing contention, but I don't think we should make that the problem
of this patch, especially because the regressions are small and go
away after a few hours of heavy use. We should optimize for the case
where the user intends to keep the database around for years, not
hours.

I came back to the question of contention recently. I don't think it's
okay to make contention worse in workloads where indexes are mostly
the same size as before, such as almost any workload that pgbench can
simulate. I have made a lot of the fact that the TPC-C indexes are
~40% smaller, in part because lots of people outside the community
find TPC-C interesting, and in part because this patch series is
focused on cases where we currently do unusually badly (cases where
good intuitions about how B-Trees are supposed to perform break down
[1]: https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
the time, but certainly not all users all of the time.

The patch series is actually supposed to *improve* the situation with
index buffer lock contention in general, and it looks like it manages
to do that with pgbench, which doesn't do inserts into indexes, except
for those required for non-HOT updates. pgbench requires relatively
few page splits, but is in every other sense a high contention
workload.

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

\set aid random_gaussian(1, 100000 * :scale, 20)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

1st pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7203.983289 (including connections establishing)
tps = 7204.020457 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.026 ms
1 client patch:
tps = 7012.575167 (including connections establishing)
tps = 7012.590007 (excluding connections establishing)
latency average = 0.143 ms
latency stddev = 0.020 ms

2 clients master:
tps = 13434.043832 (including connections establishing)
tps = 13434.076194 (excluding connections establishing)
latency average = 0.149 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13105.620223 (including connections establishing)
tps = 13105.654109 (excluding connections establishing)
latency average = 0.153 ms
latency stddev = 0.033 ms

8 clients master:
tps = 27126.852038 (including connections establishing)
tps = 27126.986978 (excluding connections establishing)
latency average = 0.295 ms
latency stddev = 0.095 ms
8 clients patch:
tps = 27945.457965 (including connections establishing)
tps = 27945.565242 (excluding connections establishing)
latency average = 0.286 ms
latency stddev = 0.089 ms

16 clients master:
tps = 32297.612323 (including connections establishing)
tps = 32297.743929 (excluding connections establishing)
latency average = 0.495 ms
latency stddev = 0.185 ms
16 clients patch:
tps = 33434.889405 (including connections establishing)
tps = 33435.021738 (excluding connections establishing)
latency average = 0.478 ms
latency stddev = 0.167 ms

64 clients master:
tps = 25699.029787 (including connections establishing)
tps = 25699.217022 (excluding connections establishing)
latency average = 2.482 ms
latency stddev = 1.715 ms
64 clients patch:
tps = 26513.816673 (including connections establishing)
tps = 26514.013638 (excluding connections establishing)
latency average = 2.405 ms
latency stddev = 1.690 ms

2nd pass
========

(init pgbench from scratch for each database, scale 20)

1 client master:
tps = 7172.995796 (including connections establishing)
tps = 7173.013472 (excluding connections establishing)
latency average = 0.139 ms
latency stddev = 0.022 ms
1 client patch:
tps = 7024.724365 (including connections establishing)
tps = 7024.739237 (excluding connections establishing)
latency average = 0.142 ms
latency stddev = 0.021 ms

2 clients master:
tps = 13489.016303 (including connections establishing)
tps = 13489.047968 (excluding connections establishing)
latency average = 0.148 ms
latency stddev = 0.032 ms
2 clients patch:
tps = 13210.292833 (including connections establishing)
tps = 13210.321528 (excluding connections establishing)
latency average = 0.151 ms
latency stddev = 0.029 ms

8 clients master:
tps = 27470.112858 (including connections establishing)
tps = 27470.229891 (excluding connections establishing)
latency average = 0.291 ms
latency stddev = 0.093 ms
8 clients patch:
tps = 28132.981815 (including connections establishing)
tps = 28133.096414 (excluding connections establishing)
latency average = 0.284 ms
latency stddev = 0.081 ms

16 clients master:
tps = 32409.399669 (including connections establishing)
tps = 32409.533400 (excluding connections establishing)
latency average = 0.493 ms
latency stddev = 0.182 ms
16 clients patch:
tps = 33678.304986 (including connections establishing)
tps = 33678.427420 (excluding connections establishing)
latency average = 0.475 ms
latency stddev = 0.168 ms

64 clients master:
tps = 25864.453485 (including connections establishing)
tps = 25864.639098 (excluding connections establishing)
latency average = 2.466 ms
latency stddev = 1.698 ms
64 clients patch:
tps = 26382.926218 (including connections establishing)
tps = 26383.166692 (excluding connections establishing)
latency average = 2.417 ms
latency stddev = 1.678 ms

There was a third run which has been omitted, because it's practically
the same as the first two. The order that results appear in is the
order things actually ran in (I like to interlace master and patch
runs closely).

Analysis
========

There seems to be a ~2% regression with one or two clients, but we
more than make up for that as the client count goes up -- the 8 and 64
client cases improve throughput by ~2.5%, and the 16 client case
improves throughput by ~4%. This seems like a totally reasonable
trade-off to me. As I said already, the patch isn't really about
workloads that we already do acceptably well on, such as this one, so
you're not expected to be impressed with these numbers. My goal is to
show that boring workloads that fit everything in shared_buffers
appear to be fine. I think that that's a reasonable conclusion, based
on these numbers. Lower client count cases are generally considered
less interesting, and also lose less in throughput than we go on to
gain later. as more clients are added. I'd be surprised if anybody
complained.

I think that the explanation for the regression with one or two
clients boils down to this: We're making better decisions about where
to split pages, and even about how pages are accessed by index scans
(more on that in the next paragraph). However, this isn't completely
free (particularly the page split stuff), and it doesn't pay for
itself until the number of clients ramps up. However, not being more
careful about that stuff is penny wise, pound foolish. I even suspect
that there are priority inversion issues when there is high contention
during unique index enforcement, which might be a big problem on
multi-socket machines with hundreds of clients. I am not in a position
to confirm that right now, but we have heard reports that are
consistent with this explanation at least once before now [2]/messages/by-id/BF3B6F54-68C3-417A-BFAB-FB4D66F2B410@postgrespro.ru -- Peter Geoghegan. Zipfian
was also somewhat better when I last measured it, using the same
fairly modest machine -- I didn't repeat that here because I wanted
something simple and widely studied.

The patch establishes the principle that there is only one good reason
to visit more than one leaf page within index scans like those used by
pgbench: a concurrent page split, where the scan simply must go right
to find matches that were just missed in the first leaf page. That
should be very rare. We should never visit two leaf pages because
we're confused about where there might be matches. There is simply no
good reason for there to be any ambiguity or confusion.

The patch could still make index scans like these visit more than a
single leaf page for a bad reason, at least in theory: when there is
at least ~400 duplicates in a unique index, and we therefore can't
possibly store them all on one leaf page, index scans will of course
have to visit more than one leaf page. Again, that should be very
rare. All index scans can now check the high key on the leaf level,
and avoid going right when they happen to be very close to the right
edge of the leaf page's key space. And, we never have to take the
scenic route when descending the tree on an equal internal page key,
since that condition has practically been eliminated by suffix
truncation. No new tuple can be equal to negative infinity, and
negative infinity appears in every pivot tuple. There is a place for
everything, and everything is in its place.

[1]: https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/
[2]: /messages/by-id/BF3B6F54-68C3-417A-BFAB-FB4D66F2B410@postgrespro.ru -- Peter Geoghegan
--
Peter Geoghegan

#118

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Peter Geoghegan (#117)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

With pgbench scale factor 20, here are results for patch and master
with a Gaussian distribution on my 8 thread/4 core home server, with
each run reported lasting 10 minutes, repeating twice for client
counts 1, 2, 8, 16, and 64, patch and master branch:

1 client master:
tps = 7203.983289 (including connections establishing)
1 client patch:
tps = 7012.575167 (including connections establishing)

2 clients master:
tps = 13434.043832 (including connections establishing)
2 clients patch:
tps = 13105.620223 (including connections establishing)

Blech. I think the patch has enough other advantages that it's worth
accepting that, but it's not great. We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

However, this isn't completely
free (particularly the page split stuff), and it doesn't pay for
itself until the number of clients ramps up.

I don't really understand that explanation. It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#119

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Robert Haas (#118)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 5:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

Blech. I think the patch has enough other advantages that it's worth
accepting that, but it's not great. We seem to keep finding reasons
to reduce single client performance in the name of scalability, which
is often reasonable not but wonderful.

The good news is that the quicksort that we now perform in
nbtsplitloc.c is not optimized at all. Heikki thought it premature to
optimize that, for example by inlining/specializing the quicksort. I
can make that 3x faster fairly easily, which could easily change the
picture here. The code will be uglier that way, but not much more
complicated. I even prototyped this, and managed to make serial
microbenchmarks I've used noticeably faster. This could very well fix
the problem here. It clearly showed up in perf profiles with serial
bulks loads.

However, this isn't completely
free (particularly the page split stuff), and it doesn't pay for
itself until the number of clients ramps up.

I don't really understand that explanation. It makes sense that more
intelligent page split decisions could require more CPU cycles, but it
is not evident to me why more clients would help better page split
decisions pay off.

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in. Each in turn, to check if they're possibly
conflicting. gcov shows that that never happens with the regression
tests once the patch is applied (you can at least get away with only
having one buffer lock on a leaf page at all times in practically all
cases).

--
Peter Geoghegan

#120

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#119)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

Smarter choices on page splits pay off with higher client counts
because they reduce contention at likely hot points. It's kind of
crazy that the code in _bt_check_unique() sometimes has to move right,
while holding an exclusive buffer lock on the original page and a
shared buffer lock on its sibling page at the same time. It then has
to hold a third buffer lock concurrently, this time on any heap pages
it is interested in.

Actually, by the time we get to 16 clients, this workload does make
the indexes and tables smaller. Here is pg_buffercache output
collected after the first 16 client case:

Master
======

relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼────────────────────────
pgbench_history │ 0 │
123,484 │ 123,484 │ 4.9989715266755207
pgbench_accounts │ 0 │
34,665 │ 10,682 │ 4.4948511514697622
pgbench_accounts_pkey │ 0 │
5,708 │ 1,561 │ 4.8731582319026265
pgbench_tellers │ 0 │
489 │ 489 │ 5.0000000000000000
pgbench_branches │ 0 │
284 │ 284 │ 5.0000000000000000
pgbench_tellers_pkey │ 0 │
56 │ 56 │ 5.0000000000000000
....

Patch
=====

relname │ relforknumber │
size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg
─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼────────────────────────
pgbench_history │ 0 │
127,864 │ 127,864 │ 4.9980447975974473
pgbench_accounts │ 0 │
33,933 │ 9,614 │ 4.3517786561264822
pgbench_accounts_pkey │ 0 │
5,487 │ 1,322 │ 4.8857791225416036
pgbench_tellers │ 0 │
204 │ 204 │ 4.9803921568627451
pgbench_branches │ 0 │
198 │ 198 │ 4.3535353535353535
pgbench_tellers_pkey │ 0 │
14 │ 14 │ 5.0000000000000000
....

The main fork for pgbench_history is larger with the patch, obviously,
but that's good. pgbench_accounts_pkey is about 4% smaller, which is
probably the most interesting observation that can be made here, but
the tables are also smaller. pgbench_accounts itself is ~2% smaller.
pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller.
Of course, the smaller tables were already very small, so maybe that
isn't important. I think that this is due to more effective pruning,
possibly because we get better lock arbitration as a consequence of
better splits, and also because duplicates are in heap TID order. I
haven't observed this effect with larger databases, which have been my
focus.

It isn't weird that shared_buffers doesn't have all the
pgbench_accounts blocks, since, of course, this is highly skewed by
design -- most blocks were never accessed from the table.

This effect seems to be robust, at least with this workload. The
second round of benchmarks (which have their own pgbench -i
initialization) show very similar amounts of bloat at the same point.
It may not be that significant, but it's also not a fluke.

--
Peter Geoghegan

#121

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#116)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Mon, Mar 18, 2019 at 10:17 AM Peter Geoghegan <pg@bowt.ie> wrote:

The big difference is that you make the possible call to
_bt_stepright() conditional on this being a checkingunique index --
the duplicate code is indented in that branch of _bt_findsplitloc().
Whereas I break early in the loop when "checkingunique &&
heapkeyspace".

Heikki and I discussed this issue privately, over IM, and reached
final agreement on remaining loose ends. I'm going to use his code for
_bt_findsplitloc(). Plan to push a final version of the first four
patches tomorrow morning PST.

--
Peter Geoghegan

#122

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#121)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

Heikki and I discussed this issue privately, over IM, and reached
final agreement on remaining loose ends. I'm going to use his code for
_bt_findsplitloc(). Plan to push a final version of the first four
patches tomorrow morning PST.

I've committed the first 4 patches. Many thanks to Heikki for his very
valuable help! Thanks also to the other reviewers.

I'll likely push the remaining two patches on Sunday or Monday.

--
Peter Geoghegan

#123

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#122)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:

I've committed the first 4 patches. Many thanks to Heikki for his very
valuable help! Thanks also to the other reviewers.

I'll likely push the remaining two patches on Sunday or Monday.

I noticed that if I initidb and run "make installcheck" with and
without the "split after new tuple" optimization patch, the largest
system catalog indexes shrink quite noticeably:

Master
======
pg_depend_depender_index 1456 kB
pg_depend_reference_index 1416 kB
pg_class_tblspc_relfilenode_index 224 kB

Patch
=====
pg_depend_depender_index 1088 kB -- ~25% smaller
pg_depend_reference_index 1136 kB -- ~20% smaller
pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller

This is interesting to me because it is further evidence that the
problem that the patch targets is reasonably common. It's also
interesting to me because we benefit despite the fact there are a lot
of duplicates in parts of these indexes; we vary our strategy at
different parts of the key space, which works well. We pack pages
tightly where they're full of duplicates, using the "single value"
strategy that I've already committed, whereas the apply the "split
after new tuple" optimization in parts of the index with localized
monotonically increasing insertions. If there were no duplicates in
the indexes, then they'd be about 40% smaller, which is exactly what
we see with the TPC-C indexes (they're all unique indexes, with very
few physical duplicates). Looks like the duplicates are mostly
bootstrap mode entries. Lots of the pg_depend_depender_index
duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for
example.

I also noticed one further difference: the pg_shdepend_depender_index
index grew from 40 kB to 48 kB. I guess that might count as a
regression, though I'm not sure that it should. I think that we would
do better if the volume of data in the underlying table was greater.
contrib/pageinspect shows that a small number of the leaf pages in the
improved cases are not very filled at all, which is more than made up
for by the fact that many more pages are packed as if they were
created by a rightmost split (262 items 24 byte tuples is exactly
consistent with that). IOW, I suspect that the extra page in
pg_shdepend_depender_index is due to a "local minimum".

--
Peter Geoghegan

#124

Peter Geoghegan

pg@bowt.ie

almost 7 years ago

In reply to: Peter Geoghegan (#123)

4 attachment(s)

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:

I'll likely push the remaining two patches on Sunday or Monday.

I noticed that if I initidb and run "make installcheck" with and
without the "split after new tuple" optimization patch, the largest
system catalog indexes shrink quite noticeably:

I pushed this final patch a week ago, as commit f21668f3, concluding
work on integrating the patch series.

I have some closing thoughts that I would like to close out the
project on. I was casually discussing this project over IM with Robert
the other day. I was asked a question I'd often asked myself about the
"split after new item" heuristics: What if you're wrong? What if some
"black swan" type workload fools your heuristics into bloating an
index uncontrollably?

I gave an answer to his question that may have seemed kind of
inscrutable. My intuition about the worst case for the heuristics is
based on its similarity to the worst case for quicksort. Any
real-world instance of quicksort going quadratic is essentially a case
where we *consistently* do the wrong thing when selecting a pivot. A
random pivot selection will still perform reasonably well, because
we'll still choose the median pivot on average. A malicious actor will
always be able to fool any quicksort implementation into going
quadratic [1]https://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan in certain circumstances. We're defending against
Murphy, not Machiavelli, though, so that's okay.

I think that I can produce a more tangible argument than this, though.
Attached patch removes every heuristic that limits the application of
the "split after new item" optimization (it doesn't force the
optimization in the case of rightmost splits, or in the case where the
new item happens to be first on the page, since caller isn't prepared
for that). This is an attempt to come up with a wildly exaggerated
worst case. Nevertheless, the consequences are not actually all that
bad. Summary:

* The "UK land registry" test case that I leaned on a lot for the
patch has a final index that's about 1% larger. However, it was about
16% smaller compared to Postgres without the patch, so this is not a
problem.

* Most of the TPC-C indexes are actually slightly smaller, because we
didn't quite go as far as we could have (TPC-C strongly rewards this
optimization). 8 out of the 10 indexes are either smaller or
unchanged. The customer name index is about 28% larger, though. The
oorder table index is also about 28% larger.

* TPC-E never benefits from the "split after new item" optimization,
and yet the picture isn't so bad here either. The holding history PK
is about 40% bigger, which is quite bad, and the biggest regression
overall. However, in other affected cases indexes are about 15%
larger, which is not that bad.

Also attached are the regressions from my test suite in the form of
diff files -- these are the full details of the regression, just in
case that's interesting to somebody.

This isn't the final word. I'm not asking anybody to accept with total
certainty that there can never be a "black swan" workload that the
heuristics consistently mishandle, leading to pathological
performance. However, I think it's fair to say that the risk of that
happening has been managed well. The attached test patch literally
removes any restraint on applying the optimization, and yet we
arguably do no worse than Postgres 11 would overall.

Once again, I would like to thank my collaborators for all their help,
especially Heikki.

[1]: https://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

always-split-after-new-item.patchapplication/octet-stream; name=always-split-after-new-item.patchDownload

diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 50d2faba29..c792c5948c 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -601,87 +601,17 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 
 	nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 
-	/* Single key indexes not considered here */
-	if (nkeyatts == 1)
-		return false;
-
 	/* Ascending insertion pattern never inferred when new item is first */
 	if (state->newitemoff == P_FIRSTKEY)
 		return false;
-
-	/*
-	 * Only apply optimization on pages with equisized tuples, since ordinal
-	 * keys are likely to be fixed-width.  Testing if the new tuple is
-	 * variable width directly might also work, but that fails to apply the
-	 * optimization to indexes with a numeric_ops attribute.
-	 *
-	 * Conclude that page has equisized tuples when the new item is the same
-	 * width as the smallest item observed during pass over page, and other
-	 * non-pivot tuples must be the same width as well.  (Note that the
-	 * possibly-truncated existing high key isn't counted in
-	 * olddataitemstotal, and must be subtracted from maxoff.)
-	 */
-	if (state->newitemsz != state->minfirstrightsz)
-		return false;
-	if (state->newitemsz * (maxoff - 1) != state->olddataitemstotal)
-		return false;
-
-	/*
-	 * Avoid applying optimization when tuples are wider than a tuple
-	 * consisting of two non-NULL int8/int64 attributes (or four non-NULL
-	 * int4/int32 attributes)
-	 */
-	if (state->newitemsz >
-		MAXALIGN(sizeof(IndexTupleData) + sizeof(int64) * 2) +
-		sizeof(ItemIdData))
-		return false;
-
-	/*
-	 * At least the first attribute's value must be equal to the corresponding
-	 * value in previous tuple to apply optimization.  New item cannot be a
-	 * duplicate, either.
-	 *
-	 * Handle case where new item is to the right of all items on the existing
-	 * page.  This is suggestive of monotonically increasing insertions in
-	 * itself, so the "heap TID adjacency" test is not applied here.
-	 */
 	if (state->newitemoff > maxoff)
 	{
-		itemid = PageGetItemId(state->page, maxoff);
-		tup = (IndexTuple) PageGetItem(state->page, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
-
-		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
 			*usemult = true;
 			return true;
 		}
-
-		return false;
 	}
 
-	/*
-	 * "Low cardinality leading column, high cardinality suffix column"
-	 * indexes with a random insertion pattern (e.g., an index with a boolean
-	 * column, such as an index on '(book_is_in_print, book_isbn)') present us
-	 * with a risk of consistently misapplying the optimization.  We're
-	 * willing to accept very occasional misapplication of the optimization,
-	 * provided the cases where we get it wrong are rare and self-limiting.
-	 *
-	 * Heap TID adjacency strongly suggests that the item just to the left was
-	 * inserted very recently, which limits overapplication of the
-	 * optimization.  Besides, all inappropriate cases triggered here will
-	 * still split in the middle of the page on average.
-	 */
-	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
-	tup = (IndexTuple) PageGetItem(state->page, itemid);
-	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
-		return false;
-	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
-
-	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
 		double		interp = (double) state->newitemoff / ((double) maxoff + 1);
 		double		leaffillfactormult = (double) leaffillfactor / 100.0;

land_balanced.diffapplication/octet-stream; name=land_balanced.diffDownload

--- ./patch/land_balanced_expected.out	2019-03-09 11:16:49.387280644 -0800
+++ ./output/land_balanced_results.out	2019-04-01 14:45:02.861656881 -0700
@@ -1,1491 +1,1500 @@
  land2      | composite  | R         |      1 |   12.000       |   12.000         |   12.000       |   12.000       |   32.000
- land2      | composite  | I         |   1485 |   96.907       |   61.000         |   50.000       |  194.000       |   51.115
- land2      | composite  | L         | 140951 |  153.223       |   56.000         |    2.000       |  284.000       |   44.455
+ land2      | composite  | I         |   1494 |   97.108       |   64.000         |   50.000       |  195.000       |   51.055
+ land2      | composite  | L         | 142106 |  151.986       |   47.000         |    2.000       |  284.000       |   44.449

tpcc_balanced.diffapplication/octet-stream; name=tpcc_balanced.diffDownload

--- ./patch/tpcc_balanced_expected.out	2019-03-09 11:16:49.387280644 -0800
+++ ./output/tpcc_balanced_results.out	2019-04-01 14:42:52.301507209 -0700
@@ -1,9 +1,9 @@
  customer2   | customer_pkey2                        | R         |      1 |   28.000       |   28.000         |   28.000       |   28.000       |   22.000
- customer2   | customer_pkey2                        | I         |     28 |  211.929       |  147.000         |  147.000       |  287.000       |   23.000
- customer2   | customer_pkey2                        | L         |   5907 |  254.936       |  146.000         |   18.000       |  291.000       |   23.992
- customer2   | idx_customer_name2                    | R         |      1 |   98.000       |   98.000         |   98.000       |   98.000       |   30.000
- customer2   | idx_customer_name2                    | I         |     98 |  137.255       |  110.000         |  109.000       |  218.000       |   32.173
- customer2   | idx_customer_name2                    | L         |  13354 |  113.326       |   86.000         |   69.000       |  172.000       |   43.925
+ customer2   | customer_pkey2                        | I         |     28 |  207.893       |  147.000         |  142.000       |  278.000       |   23.000
+ customer2   | customer_pkey2                        | L         |   5794 |  259.888       |  262.000         |   29.000       |  291.000       |   23.992
+ customer2   | idx_customer_name2                    | R         |      1 |  142.000       |  142.000         |  142.000       |  142.000       |   31.000
+ customer2   | idx_customer_name2                    | I         |    142 |  121.599       |   97.000         |   96.000       |  200.000       |   36.958
+ customer2   | idx_customer_name2                    | L         |  17126 |   88.586       |   20.000         |    3.000       |  172.000       |   43.933
  district2   | district_pkey2                        | R         |      1 |    2.000       |    2.000         |    2.000       |    2.000       |   12.000
  district2   | district_pkey2                        | L         |      2 |  250.500       |  130.000         |  130.000       |  371.000       |   16.000
  item2       | item_pkey2                            | R         |      1 |  274.000       |  274.000         |  274.000       |  274.000       |   15.000
@@ -11,17 +11,17 @@
  new_order2  | new_order_pkey2                       | R         |      1 |    8.000       |    8.000         |    8.000       |    8.000       |   17.000
  new_order2  | new_order_pkey2                       | I         |      8 |  208.125       |  141.000         |  141.000       |  237.000       |   23.000
  new_order2  | new_order_pkey2                       | L         |   1658 |  256.281       |  246.000         |   31.000       |  289.000       |   23.963
- oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | R         |      1 |   41.000       |   41.000         |   41.000       |   41.000       |   23.000
- oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | I         |     41 |  189.317       |  147.000         |  146.000       |  275.000       |   23.000
- oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | L         |   7722 |  195.250       |  147.000         |    8.000       |  291.000       |   23.995
- oorder2     | oorder_pkey2                          | R         |      1 |   33.000       |   33.000         |   33.000       |   33.000       |   22.000
- oorder2     | oorder_pkey2                          | I         |     33 |  177.091       |  147.000         |  125.000       |  279.000       |   23.000
- oorder2     | oorder_pkey2                          | L         |   5812 |  259.087       |  262.000         |   38.000       |  289.000       |   23.987
+ oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | R         |      1 |   50.000       |   50.000         |   50.000       |   50.000       |   23.000
+ oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | I         |     50 |  198.960       |  155.000         |  146.000       |  243.000       |   23.000
+ oorder2     | oorder_o_w_id_o_d_id_o_c_id_o_id_key2 | L         |   9899 |  152.530       |   32.000         |    3.000       |  291.000       |   23.997
+ oorder2     | oorder_pkey2                          | R         |      1 |   32.000       |   32.000         |   32.000       |   32.000       |   21.000
+ oorder2     | oorder_pkey2                          | I         |     32 |  182.500       |  147.000         |  100.000       |  279.000       |   23.000
+ oorder2     | oorder_pkey2                          | L         |   5809 |  259.220       |  262.000         |   38.000       |  289.000       |   23.987
  order_line2 | order_line_pkey2                      | R         |      1 |    2.000       |    2.000         |    2.000       |    2.000       |   16.000
- order_line2 | order_line_pkey2                      | I         |    371 |  161.943       |  147.000         |  146.000       |  292.000       |   23.000
- order_line2 | order_line_pkey2                      | L         |  59343 |  253.754       |  191.000         |   27.000       |  291.000       |   23.999
+ order_line2 | order_line_pkey2                      | I         |    371 |  161.790       |  147.000         |  146.000       |  292.000       |   23.000
+ order_line2 | order_line_pkey2                      | L         |  59286 |  253.997       |  197.000         |   27.000       |  291.000       |   23.999
  stock2      | stock_pkey2                           | R         |      1 |   50.000       |   50.000         |   50.000       |   50.000       |   15.000
- stock2      | stock_pkey2                           | I         |     50 |  275.700       |  205.000         |  205.000       |  360.000       |   15.000
- stock2      | stock_pkey2                           | L         |  13736 |  365.007       |  367.000         |   35.000       |  407.000       |   16.000
+ stock2      | stock_pkey2                           | I         |     50 |  275.000       |  205.000         |  173.000       |  360.000       |   15.000
+ stock2      | stock_pkey2                           | L         |  13701 |  365.937       |  367.000         |   35.000       |  407.000       |   16.000
  warehouse2  | warehouse_pkey2                       | L         |      1 |   50.000       |   50.000         |   50.000       |   50.000       |   16.000

tpce_balanced.diffapplication/octet-stream; name=tpce_balanced.diffDownload

--- ./patch/tpce_balanced_expected.out	2019-03-08 19:08:42.001831881 -0800
+++ ./output/tpce_balanced_results.out	2019-04-01 14:52:40.870463831 -0700
@@ -1,51 +1,51 @@
  account_permission2 | pk_account_permission2 | R         |      1 |   36.000       |   36.000         |   36.000       |   36.000       |   15.000
  account_permission2 | pk_account_permission2 | L         |     36 |  198.972       |  203.000         |   34.000       |  204.000       |   31.028
  broker2             | pk_broker2             | L         |      1 |   10.000       |   10.000         |   10.000       |   10.000       |   16.000
- cash_transaction2   | pk_cash_transaction2   | R         |      1 |  234.000       |  234.000         |  234.000       |  234.000       |   15.000
- cash_transaction2   | pk_cash_transaction2   | I         |    234 |  285.808       |  286.000         |  241.000       |  286.000       |   15.000
- cash_transaction2   | pk_cash_transaction2   | L         |  66646 |  239.527       |  219.000         |  106.000       |  407.000       |   16.000
+ cash_transaction2   | pk_cash_transaction2   | R         |      1 |  265.000       |  265.000         |  265.000       |  265.000       |   15.000
+ cash_transaction2   | pk_cash_transaction2   | I         |    265 |  286.453       |  286.000         |  286.000       |  406.000       |   15.000
+ cash_transaction2   | pk_cash_transaction2   | L         |  75646 |  211.148       |   43.000         |    3.000       |  407.000       |   16.000
  charge2             | pk_charge2             | L         |      1 |   15.000       |   15.000         |   15.000       |   15.000       |   16.000
  commission_rate2    | pk_commission_rate2    | L         |      1 |  240.000       |  240.000         |  240.000       |  240.000       |   26.000
- company2            | i_co_name2             | R         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   32.000
- company2            | i_co_name2             | L         |      5 |  100.800       |   23.000         |   23.000       |  141.000       |   35.000
+ company2            | i_co_name2             | R         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   27.000
+ company2            | i_co_name2             | L         |      5 |  100.800       |   23.000         |   23.000       |  201.000       |   34.800
  company2            | pk_company2            | R         |      1 |    2.000       |    2.000         |    2.000       |    2.000       |   12.000
  company2            | pk_company2            | L         |      2 |  250.500       |  134.000         |  134.000       |  367.000       |   16.000
  company_competitor2 | pk_company_competitor2 | R         |      1 |    8.000       |    8.000         |    8.000       |    8.000       |   15.000
  company_competitor2 | pk_company_competitor2 | L         |      8 |  188.375       |   72.000         |   72.000       |  205.000       |   31.125
- customer2           | i_c_tax_id2            | R         |      1 |    9.000       |    9.000         |    9.000       |    9.000       |   22.000
- customer2           | i_c_tax_id2            | L         |      9 |  223.111       |  213.000         |  213.000       |  254.000       |   24.000
+ customer2           | i_c_tax_id2            | R         |      1 |   13.000       |   13.000         |   13.000       |   13.000       |   22.000
+ customer2           | i_c_tax_id2            | L         |     13 |  154.769       |   35.000         |   35.000       |  273.000       |   24.000
  customer_account2   | i_ca_c_id2             | R         |      1 |   14.000       |   14.000         |   14.000       |   14.000       |   15.000
  customer_account2   | i_ca_c_id2             | L         |     14 |  358.071       |  239.000         |  239.000       |  370.000       |   16.000
  customer_account2   | pk_customer_account2   | R         |      1 |   14.000       |   14.000         |   14.000       |   14.000       |   15.000
  customer_account2   | pk_customer_account2   | L         |     14 |  358.071       |  242.000         |  242.000       |  367.000       |   16.000
- customer_taxrate2   | pk_customer_taxrate2   | R         |      1 |   14.000       |   14.000         |   14.000       |   14.000       |   17.000
- customer_taxrate2   | pk_customer_taxrate2   | L         |     14 |  143.786       |   32.000         |   32.000       |  267.000       |   23.286
+ customer_taxrate2   | pk_customer_taxrate2   | R         |      1 |   12.000       |   12.000         |   12.000       |   12.000       |   16.000
+ customer_taxrate2   | pk_customer_taxrate2   | L         |     12 |  167.583       |   68.000         |   68.000       |  264.000       |   23.250
  daily_market2       | i_dm_s_symb2           | R         |      1 |   14.000       |   14.000         |   14.000       |   14.000       |   15.000
  daily_market2       | i_dm_s_symb2           | I         |     14 |  196.643       |  104.000         |  104.000       |  293.000       |   21.000
  daily_market2       | i_dm_s_symb2           | L         |   2740 |  327.250       |  136.000         |  135.000       |  391.000       |   16.000
- daily_market2       | pk_daily_market2       | R         |      1 |   20.000       |   20.000         |   20.000       |   20.000       |   15.000
- daily_market2       | pk_daily_market2       | I         |     20 |  205.600       |  135.000         |  135.000       |  305.000       |   21.500
- daily_market2       | pk_daily_market2       | L         |   4093 |  219.403       |  158.000         |  153.000       |  319.000       |   21.493
+ daily_market2       | pk_daily_market2       | R         |      1 |   23.000       |   23.000         |   23.000       |   23.000       |   15.000
+ daily_market2       | pk_daily_market2       | I         |     23 |  240.609       |  181.000         |  162.000       |  306.000       |   20.957
+ daily_market2       | pk_daily_market2       | L         |   5512 |  163.178       |   35.000         |    3.000       |  320.000       |   21.481
  exchange2           | pk_exchange2           | L         |      1 |    4.000       |    4.000         |    4.000       |    4.000       |   16.000
  financial2          | pk_financial2          | R         |      1 |   39.000       |   39.000         |   39.000       |   39.000       |   15.000
  financial2          | pk_financial2          | L         |     39 |  257.385       |  261.000         |  120.000       |  261.000       |   23.026
- holding2            | i_holding2             | R         |      1 |   21.000       |   21.000         |   21.000       |   21.000       |   16.000
- holding2            | i_holding2             | I         |     21 |  209.333       |  205.000         |  174.000       |  218.000       |   22.714
- holding2            | i_holding2             | L         |   4376 |  203.648       |  137.000         |   17.000       |  291.000       |   23.895
- holding2            | pk_holding2            | R         |      1 |   11.000       |   11.000         |   11.000       |   11.000       |   15.000
- holding2            | pk_holding2            | I         |     11 |  277.909       |  235.000         |  235.000       |  328.000       |   15.000
- holding2            | pk_holding2            | L         |   3047 |  292.036       |  211.000         |  122.000       |  407.000       |   16.000
- holding_history2    | i_hh_t_id2             | R         |      1 |  360.000       |  360.000         |  360.000       |  360.000       |   15.000
- holding_history2    | i_hh_t_id2             | I         |    360 |  285.553       |  286.000         |  125.000       |  286.000       |   15.000
- holding_history2    | i_hh_t_id2             | L         | 102440 |  227.059       |  214.000         |  203.000       |  405.000       |   16.000
- holding_history2    | pk_holding_history2    | R         |      1 |  392.000       |  392.000         |  392.000       |  392.000       |   15.000
- holding_history2    | pk_holding_history2    | I         |    392 |  286.189       |  256.000         |  212.000       |  394.000       |   15.000
- holding_history2    | pk_holding_history2    | L         | 111795 |  208.143       |  164.000         |   16.000       |  291.000       |   23.000
+ holding2            | i_holding2             | R         |      1 |   23.000       |   23.000         |   23.000       |   23.000       |   17.000
+ holding2            | i_holding2             | I         |     23 |  206.696       |  203.000         |  119.000       |  219.000       |   22.957
+ holding2            | i_holding2             | L         |   4732 |  188.402       |   35.000         |    6.000       |  291.000       |   23.913
+ holding2            | pk_holding2            | R         |      1 |   13.000       |   13.000         |   13.000       |   13.000       |   15.000
+ holding2            | pk_holding2            | I         |     13 |  323.615       |  212.000         |  212.000       |  405.000       |   15.000
+ holding2            | pk_holding2            | L         |   4195 |  212.391       |   43.000         |    3.000       |  407.000       |   16.000
+ holding_history2    | i_hh_t_id2             | R         |      1 |  388.000       |  388.000         |  388.000       |  388.000       |   15.000
+ holding_history2    | i_hh_t_id2             | I         |    388 |  285.848       |  286.000         |  227.000       |  286.000       |   15.000
+ holding_history2    | i_hh_t_id2             | L         | 110522 |  210.529       |   43.000         |    3.000       |  407.000       |   16.000
+ holding_history2    | pk_holding_history2    | R         |      1 |    2.000       |    2.000         |    2.000       |    2.000       |   12.000
+ holding_history2    | pk_holding_history2    | I         |    535 |  292.596       |  231.000         |  191.000       |  408.000       |   15.656
+ holding_history2    | pk_holding_history2    | L         | 155473 |  149.949       |   31.000         |    3.000       |  291.000       |   22.995
  holding_summary2    | pk_holding_summary2    | R         |      1 |  193.000       |  193.000         |  193.000       |  193.000       |   17.000
  holding_summary2    | pk_holding_summary2    | L         |    193 |  261.482       |  258.000         |  170.000       |  266.000       |   23.233
  industry2           | pk_industry2           | L         |      1 |  102.000       |  102.000         |  102.000       |  102.000       |   16.000
- last_trade2         | pk_last_trade2         | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
- last_trade2         | pk_last_trade2         | L         |      3 |  229.000       |   44.000         |   44.000       |  394.000       |   16.000
+ last_trade2         | pk_last_trade2         | R         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   14.000
+ last_trade2         | pk_last_trade2         | L         |      5 |  137.800       |   29.000         |   29.000       |  341.000       |   16.000
  news_item2          | pk_news_item2          | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
  news_item2          | pk_news_item2          | L         |      3 |  334.000       |  268.000         |  268.000       |  367.000       |   16.000
  news_xref2          | pk_news_xref2          | R         |      1 |    4.000       |    4.000         |    4.000       |    4.000       |   14.000
@@ -53,36 +53,36 @@
  sector2             | pk_sector2             | L         |      1 |   12.000       |   12.000         |   12.000       |   12.000       |   16.000
  security2           | i_security2            | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
  security2           | i_security2            | L         |      3 |  229.000       |  164.000         |  164.000       |  262.000       |   23.333
- security2           | pk_security2           | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
- security2           | pk_security2           | L         |      3 |  229.000       |   44.000         |   44.000       |  394.000       |   16.000
- settlement2         | pk_settlement2         | R         |      1 |  269.000       |  269.000         |  269.000       |  269.000       |   15.000
- settlement2         | pk_settlement2         | I         |    269 |  286.171       |  286.000         |  286.000       |  332.000       |   15.000
- settlement2         | pk_settlement2         | L         |  76712 |  226.258       |  216.000         |   47.000       |  385.000       |   16.000
+ security2           | pk_security2           | R         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   14.000
+ security2           | pk_security2           | L         |      5 |  137.800       |   29.000         |   29.000       |  341.000       |   16.000
+ settlement2         | pk_settlement2         | R         |      1 |  289.000       |  289.000         |  289.000       |  289.000       |   15.000
+ settlement2         | pk_settlement2         | I         |    289 |  285.685       |  286.000         |  195.000       |  286.000       |   15.000
+ settlement2         | pk_settlement2         | L         |  82275 |  211.027       |   43.000         |    3.000       |  407.000       |   16.000
  status_type2        | pk_status_type2        | L         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   16.000
  taxrate2            | pk_taxrate2            | L         |      1 |  320.000       |  320.000         |  320.000       |  320.000       |   16.000
- trade2              | i_t_ca_id2             | R         |      1 |  207.000       |  207.000         |  207.000       |  207.000       |   16.000
- trade2              | i_t_ca_id2             | I         |    207 |  225.976       |  154.000         |  114.000       |  299.000       |   22.758
+ trade2              | i_t_ca_id2             | R         |      1 |  204.000       |  204.000         |  204.000       |  204.000       |   15.000
+ trade2              | i_t_ca_id2             | I         |    204 |  229.284       |  152.000         |  114.000       |  296.000       |   22.789
  trade2              | i_t_ca_id2             | L         |  46571 |  372.046       |  201.000         |   18.000       |  407.000       |   16.000
  trade2              | i_t_s_symb2            | R         |      1 |  203.000       |  203.000         |  203.000       |  203.000       |   19.000
- trade2              | i_t_s_symb2            | I         |    203 |  220.808       |  150.000         |  119.000       |  290.000       |   23.000
+ trade2              | i_t_s_symb2            | I         |    203 |  220.808       |  149.000         |  117.000       |  293.000       |   23.000
  trade2              | i_t_s_symb2            | L         |  44622 |  388.253       |  391.000         |   18.000       |  406.000       |   16.000
  trade2              | i_t_st_id2             | R         |      1 |  217.000       |  217.000         |  217.000       |  217.000       |   23.000
  trade2              | i_t_st_id2             | I         |    217 |  205.180       |  205.000         |  205.000       |  244.000       |   23.000
  trade2              | i_t_st_id2             | L         |  44308 |  390.997       |  391.000         |  270.000       |  391.000       |   16.000
- trade2              | pk_trade2              | R         |      1 |  269.000       |  269.000         |  269.000       |  269.000       |   15.000
- trade2              | pk_trade2              | I         |    269 |  286.171       |  286.000         |  286.000       |  332.000       |   15.000
- trade2              | pk_trade2              | L         |  76712 |  226.258       |  216.000         |   47.000       |  385.000       |   16.000
- trade_history2      | pk_trade_history2      | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
- trade_history2      | pk_trade_history2      | I         |    867 |  286.030       |  286.000         |  286.000       |  304.000       |   15.000
- trade_history2      | pk_trade_history2      | L         | 246259 |  169.391       |   85.000         |    6.000       |  291.000       |   22.999
+ trade2              | pk_trade2              | R         |      1 |  289.000       |  289.000         |  289.000       |  289.000       |   15.000
+ trade2              | pk_trade2              | I         |    289 |  285.685       |  286.000         |  195.000       |  286.000       |   15.000
+ trade2              | pk_trade2              | L         |  82275 |  211.027       |   43.000         |    3.000       |  407.000       |   16.000
+ trade_history2      | pk_trade_history2      | R         |      1 |    4.000       |    4.000         |    4.000       |    4.000       |   14.000
+ trade_history2      | pk_trade_history2      | I         |   1057 |  265.702       |  262.000         |  198.000       |  311.000       |   16.994
+ trade_history2      | pk_trade_history2      | L         | 278739 |  149.770       |   32.000         |    4.000       |  291.000       |   23.191
  trade_type2         | pk_trade_type2         | L         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   16.000
- watch_item2         | pk_watch_item2         | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   16.000
- watch_item2         | pk_watch_item2         | I         |      3 |  211.333       |  196.000         |  196.000       |  222.000       |   23.000
- watch_item2         | pk_watch_item2         | L         |    632 |  160.179       |   33.000         |    8.000       |  291.000       |   23.957
+ watch_item2         | pk_watch_item2         | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   18.000
+ watch_item2         | pk_watch_item2         | I         |      3 |  225.333       |  200.000         |  200.000       |  242.000       |   23.000
+ watch_item2         | pk_watch_item2         | L         |    674 |  150.260       |   26.000         |    3.000       |  291.000       |   23.969
  watch_list2         | i_wl_c_id2             | R         |      1 |    3.000       |    3.000         |    3.000       |    3.000       |   13.000
  watch_list2         | i_wl_c_id2             | L         |      3 |  334.000       |  268.000         |  268.000       |  367.000       |   16.000
  watch_list2         | pk_watch_list2         | R         |      1 |    5.000       |    5.000         |    5.000       |    5.000       |   14.000
- watch_list2         | pk_watch_list2         | L         |      5 |  200.800       |  101.000         |  101.000       |  227.000       |   16.000
- zip_code2           | pk_zip_code2           | R         |      1 |   64.000       |   64.000         |   64.000       |   64.000       |   15.000
- zip_code2           | pk_zip_code2           | L         |     64 |  231.313       |  204.000         |   88.000       |  399.000       |   16.000
+ watch_list2         | pk_watch_list2         | L         |      5 |  200.800       |   93.000         |   93.000       |  348.000       |   16.000
+ zip_code2           | pk_zip_code2           | R         |      1 |   50.000       |   50.000         |   50.000       |   50.000       |   15.000
+ zip_code2           | pk_zip_code2           | L         |     50 |  295.800       |   75.000         |   63.000       |  367.000       |   16.000