64-bit XIDs in deleted nbtree pages

Started by Peter Geogheganalmost 5 years ago46 messages

pg@bowt.ie

almost 5 years ago

2 attachment(s)

There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]/messages/by-id/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2]/messages/by-id/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com -- Peter Geoghegan.

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.

Of course it will still be possible for the FSM to leak deleted nbtree
index pages with the patch -- in general the FSM isn't crash safe.
That isn't so bad with the patch, though, because a subsequent VACUUM
will eventually notice the really old deleted pages, and add them back
to the FSM once again. This will always happen because
VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become
confused about the age of deleted pages, even when they're really old.

The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.

[1]: /messages/by-id/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com
[2]: /messages/by-id/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v1-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchapplication/x-patch; name=v1-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchDownload

From 39ef90d96d0c061b2e537c4cdc9899e4770c3023 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v1 1/2] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing within GiST.
---
 src/include/access/nbtree.h           |  79 ++++++++++--
 src/include/access/nbtxlog.h          |  26 ++--
 src/include/storage/standby.h         |   2 +
 src/backend/access/gist/gistxlog.c    |  24 +---
 src/backend/access/nbtree/nbtinsert.c |  20 +--
 src/backend/access/nbtree/nbtpage.c   | 167 +++++++++++++++-----------
 src/backend/access/nbtree/nbtree.c    |  47 +++++---
 src/backend/access/nbtree/nbtsearch.c |   6 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/nbtree/nbtxlog.c   |  37 +++---
 src/backend/access/rmgrdesc/nbtdesc.c |  13 +-
 src/backend/storage/ipc/standby.c     |  28 +++++
 contrib/amcheck/verify_nbtree.c       |  71 ++++++-----
 contrib/pageinspect/btreefuncs.c      |  62 +++++++---
 contrib/pgstattuple/pgstatindex.c     |   8 +-
 15 files changed, 384 insertions(+), 208 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..17083e9d76 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
  *
  *	In addition, we store the page's btree level (counting upwards from
  *	zero at a leaf page) as well as some flag bits indicating the page type
- *	and status.  If the page is deleted, we replace the level with the
- *	next-transaction-ID value indicating when it is safe to reclaim the page.
+ *	and status.  If the page is deleted, a BTDeletedPageContents struct is
+ *	stored in the page's tuple area, while a standard BTPageOpaqueData struct
+ *	is stored in the page special area.
  *
  *	We also store a "vacuum cycle ID".  When a page is split while VACUUM is
  *	processing the index, a nonzero value associated with the VACUUM run is
@@ -52,17 +53,24 @@ typedef uint16 BTCycleId;
  *
  *	NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
  *	instead.
+ *
+ *	NOTE: the btpo_level field used to be a union type in order to allow
+ *	deleted pages to store a 32-bit safexid is space now used only for page
+ *	level.  PostgreSQL 14+ consistently maintain the BTP_LEAF flag, as well as
+ *	the btpo_level field, which can be useful during testing and analysis.
+ *
+ *	(Actually, that's not quite true.  It's still possible for a pg_upgraded'd
+ *	database to have a BTP_DELETED page that's not marked BTP_HAS_FULLXID, in
+ *	which case btpo_level will not in fact store the page level.  This limited
+ *	exception is inconsequential -- we simply assume that such a page is safe
+ *	to recycle anyway.)
  */
 
 typedef struct BTPageOpaqueData
 {
 	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
 	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
-	union
-	{
-		uint32		level;		/* tree level --- zero for leaf pages */
-		TransactionId xact;		/* next transaction ID, if deleted */
-	}			btpo;
+	uint32		btpo_level;		/* tree level --- zero for leaf pages */
 	uint16		btpo_flags;		/* flag bits, see below */
 	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
 } BTPageOpaqueData;
@@ -78,6 +86,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_FULLXID	(1 << 8)	/* page has a BTDeletedPageContents */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -105,8 +114,7 @@ typedef struct BTMetaPageData
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
 	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
-	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
-										 * pages */
+	TransactionId btm_oldest_btpo_xact; /* oldest xid among deleted pages */
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
@@ -220,6 +228,55 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		(((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
 #define P_HAS_GARBAGE(opaque)	(((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
 #define P_INCOMPLETE_SPLIT(opaque)	(((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
+#define P_HAS_FULLXID(opaque)	(((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
+
+/*
+ * On a deleted page, we store this struct.  A deleted page doesn't contain
+ * any tuples, so we don't use the normal page layout with line pointers.
+ * Instead, this struct is stored right after the standard page header.
+ */
+typedef struct BTDeletedPageContents
+{
+	/* last xid which could see the page in a scan */
+	FullTransactionId safexid;
+} BTDeletedPageContents;
+
+static inline void
+BTPageSetDeleted(Page page, FullTransactionId safexid)
+{
+	BTPageOpaque			opaque;
+	PageHeader				header;
+	BTDeletedPageContents  *contents;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	header = ((PageHeader) page);
+
+	opaque->btpo_flags &= ~BTP_HALF_DEAD;
+	opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
+	header->pd_lower =
+			MAXALIGN(SizeOfPageHeaderData) + sizeof(BTDeletedPageContents);
+	header->pd_upper = header->pd_special;
+
+	/* Set safexid */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	contents->safexid = safexid;
+}
+
+static inline FullTransactionId
+BTPageGetDeleteXid(Page page)
+{
+	BTPageOpaque			opaque PG_USED_FOR_ASSERTS_ONLY;
+	BTDeletedPageContents  *contents;
+
+	/* pg_upgrade'd indexes with old BTP_DELETED pages should not call here */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISDELETED(opaque) && !P_ISHALFDEAD(opaque) &&
+		   P_HAS_FULLXID(opaque));
+
+	/* Get safexid */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	return contents->safexid;
+}
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -1067,7 +1124,7 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
-										 TransactionId oldestBtpoXact, float8 numHeapTuples);
+										 FullTransactionId oldestSafeXid, float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
@@ -1092,7 +1149,7 @@ extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
 extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  TransactionId *oldestBtpoXact);
+						  FullTransactionId *oldestSafeXid);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..1ae13dd2dd 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -13,6 +13,7 @@
 #ifndef NBTXLOG_H
 #define NBTXLOG_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/off.h"
@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } xl_btree_reuse_page;
 
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
 #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  Note we do
- * not store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about deletion of a btree page.  Note that we
+ * only leave behind a small amount of bookkeeping information in deleted
+ * pages (deleted pages must be kept around as tombstones for a while).  It is
+ * convenient for the REDO routine to regenerate its target page from scratch.
+ * This is why WAL record describes certain details that are actually directly
+ * available from the target page.
  *
  * Backup Blk 0: target block being deleted
  * Backup Blk 1: target block's left sibling, if any
@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
 {
 	BlockNumber leftsib;		/* target block's left sibling, if any */
 	BlockNumber rightsib;		/* target block's right sibling */
+	uint32		level;			/* target block's level */
 
 	/*
-	 * Information needed to recreate the leaf page, when target is an
-	 * internal page.
+	 * Information needed to recreate a half-dead leaf page with correct
+	 * topparent link.  The fields are only used when deletion operation's
+	 * target page is an internal page.  REDO routine creates half-dead page
+	 * from scratch to keep things simple (this is the same convenient
+	 * approach used for the target page itself).
 	 */
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
-	BlockNumber topparent;		/* next child down in the subtree */
+	BlockNumber topparent;
 
-	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
+	FullTransactionId safexid;	/* BTPageSetDeleted() value */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
 } xl_btree_unlink_page;
 
-#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, safexid) + sizeof(FullTransactionId))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..38fd85a431 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+													   RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f2eda79bc1..1c80eae044 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
-		FullTransactionId nextXid = ReadNextFullTransactionId();
-		uint64		diff;
-
-		/*
-		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
-		 * TransactionIds, so truncate the logged FullTransactionId. If the
-		 * logged value is very old, so that XID wrap-around already happened
-		 * on it, there can't be any snapshots that still see it.
-		 */
-		diff = U64FromFullTransactionId(nextXid) -
-			U64FromFullTransactionId(latestRemovedFullXid);
-		if (diff < MaxTransactionId / 2)
-		{
-			TransactionId latestRemovedXid;
-
-			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
-												xlrec->node);
-		}
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..af96c09f46 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
-			if (metad->btm_fastlevel >= opaque->btpo.level)
+			if (metad->btm_fastlevel >= opaque->btpo_level)
 			{
 				/* no update wanted */
 				_bt_relbuf(rel, metabuf);
@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
 			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = BufferGetBlockNumber(buf);
-			metad->btm_fastlevel = opaque->btpo.level;
+			metad->btm_fastlevel = opaque->btpo_level;
 			MarkBufferDirty(metabuf);
 		}
 
@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	/* handle btpo_next after rightpage buffer acquired */
-	lopaque->btpo.level = oopaque->btpo.level;
+	lopaque->btpo_level = oopaque->btpo_level;
 	/* handle btpo_cycleid after rightpage buffer acquired */
 
 	/*
@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = oopaque->btpo_next;
-	ropaque->btpo.level = oopaque->btpo.level;
+	ropaque->btpo_level = oopaque->btpo_level;
 	ropaque->btpo_cycleid = lopaque->btpo_cycleid;
 
 	/*
@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
 
-		xlrec.level = ropaque->btpo.level;
+		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstrightoff = firstrightoff;
 		xlrec.newitemoff = newitemoff;
@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags = BTP_ROOT;
-	rootopaque->btpo.level =
-		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
+	rootopaque->btpo_level =
+		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
 	rootopaque->btpo_cycleid = 0;
 
 	/* update metapage data */
 	metad->btm_root = rootblknum;
-	metad->btm_level = rootopaque->btpo.level;
+	metad->btm_level = rootopaque->btpo_level;
 	metad->btm_fastroot = rootblknum;
-	metad->btm_fastlevel = rootopaque->btpo.level;
+	metad->btm_fastlevel = rootopaque->btpo_level;
 
 	/*
 	 * Insert the left page pointer into the new root page.  The root page is
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ac264a5952..86652fff29 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -37,7 +37,7 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   TransactionId latestRemovedXid);
+							   FullTransactionId latestRemovedFullXid);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 TransactionId *oldestBtpoXact,
+									 FullTransactionId *oldestSafeXid,
 									 uint32 *ndeleted);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
@@ -176,7 +176,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  *		to those written in the metapage.  On mismatch, metapage is overwritten.
  */
 void
-_bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
+_bt_update_meta_cleanup_info(Relation rel, FullTransactionId oldestSafeXid,
 							 float8 numHeapTuples)
 {
 	Buffer		metabuf;
@@ -184,6 +184,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	TransactionId oldestXid = XidFromFullTransactionId(oldestSafeXid);
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -193,7 +194,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
-	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
+	else if (metad->btm_oldest_btpo_xact != oldestXid ||
 			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
 		needsRewrite = true;
 
@@ -214,7 +215,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
-	metad->btm_oldest_btpo_xact = oldestBtpoXact;
+	metad->btm_oldest_btpo_xact = oldestXid;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
 	MarkBufferDirty(metabuf);
 
@@ -232,7 +233,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
-		md.oldest_btpo_xact = oldestBtpoXact;
+		md.oldest_btpo_xact = oldestXid;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
 
@@ -316,7 +317,7 @@ _bt_getroot(Relation rel, int access)
 		 * because that's not set in a "fast root".
 		 */
 		if (!P_IGNORE(rootopaque) &&
-			rootopaque->btpo.level == rootlevel &&
+			rootopaque->btpo_level == rootlevel &&
 			P_LEFTMOST(rootopaque) &&
 			P_RIGHTMOST(rootopaque))
 		{
@@ -377,7 +378,7 @@ _bt_getroot(Relation rel, int access)
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 		rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 		rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
-		rootopaque->btpo.level = 0;
+		rootopaque->btpo_level = 0;
 		rootopaque->btpo_cycleid = 0;
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
@@ -481,11 +482,11 @@ _bt_getroot(Relation rel, int access)
 			rootblkno = rootopaque->btpo_next;
 		}
 
-		/* Note: can't check btpo.level on deleted pages */
-		if (rootopaque->btpo.level != rootlevel)
+		/* Note: can't check btpo_level from !P_HAS_FULLXID() deleted page */
+		if (rootopaque->btpo_level != rootlevel)
 			elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 				 rootblkno, RelationGetRelationName(rel),
-				 rootopaque->btpo.level, rootlevel);
+				 rootopaque->btpo_level, rootlevel);
 	}
 
 	/*
@@ -585,11 +586,11 @@ _bt_gettrueroot(Relation rel)
 		rootblkno = rootopaque->btpo_next;
 	}
 
-	/* Note: can't check btpo.level on deleted pages */
-	if (rootopaque->btpo.level != rootlevel)
+	/* Note: can't check btpo_level from !P_HAS_FULLXID() deleted page */
+	if (rootopaque->btpo_level != rootlevel)
 		elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 			 rootblkno, RelationGetRelationName(rel),
-			 rootopaque->btpo.level, rootlevel);
+			 rootopaque->btpo_level, rootlevel);
 
 	return rootbuf;
 }
@@ -762,7 +763,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   FullTransactionId latestRemovedFullXid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -775,7 +777,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedFullXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -862,17 +864,18 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * If we are generating WAL for Hot Standby then create a
 					 * WAL record that will allow us to conflict with queries
 					 * running on standby, in case they have snapshots older
-					 * than btpo.xact.  This can only apply if the page does
-					 * have a valid btpo.xact value, ie not if it's new.  (We
-					 * must check that because an all-zero page has no special
-					 * space.)
+					 * than safexid value returned by BTPageGetDeleteXid().
+					 * This can only apply if the page does have a valid
+					 * safexid value, ie not if it's new.  (We must check that
+					 * because an all-zero page has no special space.)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
 						!PageIsNew(page))
 					{
-						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+						FullTransactionId latestRemovedFullXid;
 
-						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
+						latestRemovedFullXid = BTPageGetDeleteXid(page);
+						_bt_log_reuse_page(rel, blkno, latestRemovedFullXid);
 					}
 
 					/* Okay to use page.  Re-initialize and return it */
@@ -1101,9 +1104,31 @@ _bt_page_recyclable(Page page)
 	 * interested in it.
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_ISDELETED(opaque) &&
-		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
-		return true;
+	if (P_ISDELETED(opaque))
+	{
+		/*
+		 * If this is a pg_upgrade'd index, then this could be a deleted page
+		 * whose XID (which is stored in special area's level field via type
+		 * punning) is non-full 32-bit value.  It's safe to just assume that
+		 * we can recycle because the system must have been restarted since
+		 * the time of deletion.
+		 */
+		if (!P_HAS_FULLXID(opaque))
+			return true;
+
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
+	}
+
 	return false;
 }
 
@@ -1768,16 +1793,17 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * that the btvacuumscan scan has yet to reach; they'll get counted later
  * instead.
  *
- * Maintains *oldestBtpoXact for any pages that get deleted.  Caller is
- * responsible for maintaining *oldestBtpoXact in the case of pages that were
- * deleted by a previous VACUUM.
+ * Maintains *oldestSafeXid for any pages that get deleted.  Caller is
+ * responsible for maintaining *oldestSafeXid in the case of pages that were
+ * deleted by a previous VACUUM but are nevertheless not yet safe to put in
+ * the FSM for recycling.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
+_bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1985,7 +2011,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, oldestBtpoXact,
+										  &rightsib_empty, oldestSafeXid,
 										  &ndeleted))
 			{
 				/*
@@ -2001,9 +2027,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 			}
 		}
 
-		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
-		Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
-											*oldestBtpoXact));
+		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
+			   P_HAS_FULLXID(opaque));
+		Assert(FullTransactionIdFollowsOrEquals(BTPageGetDeleteXid(page),
+												*oldestSafeXid));
 
 		rightsib = opaque->btpo_next;
 
@@ -2264,11 +2291,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  * containing leafbuf.  (We always set *rightsib_empty for caller, just to be
  * consistent.)
  *
- * We maintain *oldestBtpoXact for pages that are deleted by the current
- * VACUUM operation here.  This must be handled here because we conservatively
- * assume that there needs to be a new call to ReadNewTransactionId() each
- * time a page gets deleted.  See comments about the underlying assumption
- * below.
+ * We maintain *oldestSafeXid for pages that are deleted by the current VACUUM
+ * operation here.  This must be handled here because we conservatively assume
+ * that there needs to be a new call to ReadNextFullTransactionId() each time
+ * a page gets deleted.  See comments about the underlying assumption below.
  *
  * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
  * On success exit, we'll be holding pin and write lock.  On failure exit,
@@ -2277,8 +2303,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, TransactionId *oldestBtpoXact,
-						 uint32 *ndeleted)
+						 bool *rightsib_empty,
+						 FullTransactionId *oldestSafeXid, uint32 *ndeleted)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2294,12 +2320,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	BTMetaPageData *metad = NULL;
 	ItemId		itemid;
 	Page		page;
-	PageHeader	header;
 	BTPageOpaque opaque;
+	FullTransactionId safexid;
 	bool		rightsib_is_rightmost;
-	int			targetlevel;
+	uint32		targetlevel;
 	IndexTuple	leafhikey;
-	BlockNumber nextchild;
+	BlockNumber topparent_in_target;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2343,7 +2369,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
-		targetlevel = opaque->btpo.level;
+		targetlevel = opaque->btpo_level;
 		Assert(targetlevel > 0);
 
 		/*
@@ -2450,7 +2476,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
-		nextchild = InvalidBlockNumber;
+
+		/* Leaf page is also target page: don't set topparent */
+		topparent_in_target = InvalidBlockNumber;
 	}
 	else
 	{
@@ -2459,11 +2487,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
 
-		/* Remember the next non-leaf child down in the subtree */
+		/* Internal page is target: we'll set topparent in leaf page... */
 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
-		nextchild = BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
-		if (nextchild == leafblkno)
-			nextchild = InvalidBlockNumber;
+		topparent_in_target =
+				BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
+		/* ...except when it would be a redundant pointer-to-self */
+		if (topparent_in_target == leafblkno)
+			topparent_in_target = InvalidBlockNumber;
 	}
 
 	/*
@@ -2553,13 +2583,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * no lock was held.
 	 */
 	if (target != leafblkno)
-		BTreeTupleSetTopParent(leafhikey, nextchild);
+		BTreeTupleSetTopParent(leafhikey, topparent_in_target);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
-	 * updated links to the target, ReadNewTransactionId() suffices as an
+	 * updated links to the target, ReadNextFullTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
@@ -2568,17 +2598,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
-	opaque->btpo_flags &= ~BTP_HALF_DEAD;
-	opaque->btpo_flags |= BTP_DELETED;
-	opaque->btpo.xact = ReadNewTransactionId();
 
 	/*
-	 * Remove the remaining tuples on the page.  This keeps things simple for
-	 * WAL consistency checking.
+	 * Store upper bound XID that's used to determine when deleted page is no
+	 * longer needed as a tombstone
 	 */
-	header = (PageHeader) page;
-	header->pd_lower = SizeOfPageHeaderData;
-	header->pd_upper = header->pd_special;
+	safexid = ReadNextFullTransactionId();
+	BTPageSetDeleted(page, safexid);
+	opaque->btpo_cycleid = 0;
 
 	/* And update the metapage, if needed */
 	if (BufferIsValid(metabuf))
@@ -2616,15 +2643,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (target != leafblkno)
 			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
 
-		/* information on the unlinked block */
+		/* information stored on the target/to-be-unlinked block */
 		xlrec.leftsib = leftsib;
 		xlrec.rightsib = rightsib;
-		xlrec.btpo_xact = opaque->btpo.xact;
+		xlrec.level = targetlevel;
+		xlrec.safexid = safexid;
 
 		/* information needed to recreate the leaf block (if not the target) */
 		xlrec.leafleftsib = leafleftsib;
 		xlrec.leafrightsib = leafrightsib;
-		xlrec.topparent = nextchild;
+		xlrec.topparent = topparent_in_target;
 
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
 
@@ -2681,9 +2709,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, lbuf);
 	_bt_relbuf(rel, rbuf);
 
-	if (!TransactionIdIsValid(*oldestBtpoXact) ||
-		TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
-		*oldestBtpoXact = opaque->btpo.xact;
+	/* If the target is not leafbuf, we're done with it now -- release it */
+	if (target != leafblkno)
+		_bt_relbuf(rel, buf);
+
+	/* Maintain oldestSafeXid for whole VACUUM */
+	if (!FullTransactionIdIsValid(*oldestSafeXid) ||
+		FullTransactionIdPrecedes(safexid, *oldestSafeXid))
+		*oldestSafeXid = safexid;
 
 	/*
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
@@ -2693,10 +2726,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		(*ndeleted)++;
 
-	/* If the target is not leafbuf, we're done with it now -- release it */
-	if (target != leafblkno)
-		_bt_relbuf(rel, buf);
-
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..27b41a4979 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -47,7 +47,7 @@ typedef struct
 	void	   *callback_state;
 	BTCycleId	cycleid;
 	BlockNumber totFreePages;	/* true total # of free pages */
-	TransactionId oldestBtpoXact;
+	FullTransactionId oldestSafeXid;
 	MemoryContext pagedelcontext;
 } BTVacState;
 
@@ -826,9 +826,9 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
 		/*
-		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is visible to everyone, then at least one deleted page can be
-		 * recycled -- don't skip cleanup.
+		 * If the oldest safexid/btpo_xact from a previously deleted page in
+		 * the index is visible to everyone, then at least one deleted page
+		 * can be recycled -- don't skip cleanup.
 		 */
 		result = true;
 	}
@@ -989,7 +989,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
 	vstate.totFreePages = 0;
-	vstate.oldestBtpoXact = InvalidTransactionId;
+	vstate.oldestSafeXid = InvalidFullTransactionId;
 
 	/* Create a temporary memory context to run _bt_pagedel in */
 	vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
@@ -1066,18 +1066,19 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 
 	/*
-	 * Maintain the oldest btpo.xact and a count of the current number of heap
+	 * Maintain the oldest safexid and a count of the current number of heap
 	 * tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
 	 *
-	 * The page with the oldest btpo.xact is typically a page deleted by this
+	 * The page with the oldest safexid is typically a page deleted by this
 	 * VACUUM operation, since pages deleted by a previous VACUUM operation
 	 * tend to be placed in the FSM (by the current VACUUM operation) -- such
-	 * pages are not candidates to be the oldest btpo.xact.  (Note that pages
-	 * placed in the FSM are reported as deleted pages in the bulk delete
-	 * statistics, despite not counting as deleted pages for the purposes of
-	 * determining the oldest btpo.xact.)
+	 * pages are not candidates to be the oldest safexid.
+	 *
+	 * Note that pages placed in the FSM are reported as deleted pages in the
+	 * bulk delete statistics, despite not counting as deleted pages for the
+	 * purposes of determining the oldest safexid.
 	 */
-	_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+	_bt_update_meta_cleanup_info(rel, vstate.oldestSafeXid,
 								 info->num_heap_tuples);
 
 	/* update statistics */
@@ -1198,22 +1199,32 @@ backtrack:
 	}
 	else if (P_ISDELETED(opaque))
 	{
+		FullTransactionId safexid;
+
 		/*
 		 * Already deleted page (which could be leaf or internal).  Can't
 		 * recycle yet.
 		 */
 		stats->pages_deleted++;
 
-		/* Maintain the oldest btpo.xact */
-		if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
-			TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
-			vstate->oldestBtpoXact = opaque->btpo.xact;
+		/*
+		 * Maintain oldestSafeXid.  We should only end up here with deleted
+		 * pages that have the full transaction ID representation, since
+		 * _bt_page_recyclable() always considers pg_upgrade'd deleted pages
+		 * safe to recycle (the 32-bit XID must have been from before the
+		 * upgrade).
+		 */
+		Assert(P_HAS_FULLXID(opaque));
+		safexid = BTPageGetDeleteXid(page);
+		if (!FullTransactionIdIsValid(vstate->oldestSafeXid) ||
+			FullTransactionIdPrecedes(safexid, vstate->oldestSafeXid))
+			vstate->oldestSafeXid = safexid;
 	}
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * oldestBtpoXact and pages_deleted below.
+		 * oldestSafeXid and pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1430,7 +1441,7 @@ backtrack:
 		 * count.  There will be no double-counting.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
+		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestSafeXid);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2e3bda8171..d1177d8772 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
-		if (opaque->btpo.level == 1 && access == BT_WRITE)
+		if (opaque->btpo_level == 1 && access == BT_WRITE)
 			page_access = BT_WRITE;
 
 		/* drop the read lock on the page, then acquire one on its child */
@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 		}
 
 		/* Done? */
-		if (opaque->btpo.level == level)
+		if (opaque->btpo_level == level)
 			break;
-		if (opaque->btpo.level < level)
+		if (opaque->btpo_level < level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("btree level %u not found in index \"%s\"",
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5683daa34d..2c4d7f6e25 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
 	/* Initialize BT opaque state */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_prev = opaque->btpo_next = P_NONE;
-	opaque->btpo.level = level;
+	opaque->btpo_level = level;
 	opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
 	opaque->btpo_cycleid = 0;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..b252d2e628 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
 
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = spagenumber;
-	ropaque->btpo.level = xlrec->level;
+	ropaque->btpo_level = xlrec->level;
 	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = xlrec->leftblk;
 	pageop->btpo_next = xlrec->rightblk;
-	pageop->btpo.level = 0;
+	pageop->btpo_level = 0;
 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
+	uint32		level;
+	bool		isleaf;
+	FullTransactionId safexid;
 	Buffer		leftbuf;
 	Buffer		target;
 	Buffer		rightbuf;
@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	leftsib = xlrec->leftsib;
 	rightsib = xlrec->rightsib;
+	level = xlrec->level;
+	isleaf = (level == 0);
+	safexid = xlrec->safexid;
+
+	/* No topparent link for leaf page (level 0) or level 1 */
+	Assert(xlrec->topparent == InvalidBlockNumber || level > 1);
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = leftsib;
 	pageop->btpo_next = rightsib;
-	pageop->btpo.xact = xlrec->btpo_xact;
-	pageop->btpo_flags = BTP_DELETED;
-	if (!BlockNumberIsValid(xlrec->topparent))
+	pageop->btpo_level = level;
+	BTPageSetDeleted(page, safexid);
+	if (isleaf)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		Buffer				leafbuf;
 		IndexTupleData		trunctuple;
 
+		Assert(!isleaf);
+
 		leafbuf = XLogInitBufferForRedo(record, 3);
 		page = (Page) BufferGetPage(leafbuf);
 
@@ -901,7 +912,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 		pageop->btpo_prev = xlrec->leafleftsib;
 		pageop->btpo_next = xlrec->leafrightsib;
-		pageop->btpo.level = 0;
+		pageop->btpo_level = 0;
 		pageop->btpo_cycleid = 0;
 
 		/* Add a dummy hikey item */
@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
 
 	pageop->btpo_flags = BTP_ROOT;
 	pageop->btpo_prev = pageop->btpo_next = P_NONE;
-	pageop->btpo.level = xlrec->level;
+	pageop->btpo_level = xlrec->level;
 	if (xlrec->level == 0)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
@@ -972,17 +983,15 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The
-	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisCheckRemovableFullXid(deleteXid) test in _bt_page_recyclable()
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..1a9bd36bc5 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -80,9 +80,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
 
-				appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
-								 xlrec->leftsib, xlrec->rightsib,
-								 xlrec->btpo_xact);
+				appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
+								 xlrec->leftsib, xlrec->rightsib, xlrec->level,
+								 EpochFromFullTransactionId(xlrec->safexid),
+								 XidFromFullTransactionId(xlrec->safexid));
 				appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
 								 xlrec->leafleftsib, xlrec->leafrightsib,
 								 xlrec->topparent);
@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
 
-				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
+				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->latestRemovedXid);
+								 xlrec->node.relNode,
+								 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+								 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 				break;
 			}
 		case XLOG_BTREE_META_CLEANUP:
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0eeb766943 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+/*
+ * Variant of ResolveRecoveryConflictWithSnapshot that works with
+ * FullTransactionId values
+ */
+void
+ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+										   RelFileNode node)
+{
+	/*
+	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
+	 * so truncate the logged FullTransactionId.  If the logged value is very
+	 * old, so that XID wrap-around already happened on it, there can't be any
+	 * snapshots that still see it.
+	 */
+	FullTransactionId nextXid = ReadNextFullTransactionId();
+	uint64			  diff;
+
+	diff = U64FromFullTransactionId(nextXid) -
+		U64FromFullTransactionId(latestRemovedFullXid);
+	if (diff < MaxTransactionId / 2)
+	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+	}
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b8c7793d9e..0032a2df67 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 											  P_FIRSTDATAKEY(opaque));
 				itup = (IndexTuple) PageGetItem(state->target, itemid);
 				nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
-				nextleveldown.level = opaque->btpo.level - 1;
+				nextleveldown.level = opaque->btpo_level - 1;
 			}
 			else
 			{
@@ -795,13 +795,13 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 			bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
 
 		/* Check level, which must be valid for non-ignorable page */
-		if (level.level != opaque->btpo.level)
+		if (level.level != opaque->btpo_level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										current, level.level, opaque->btpo.level)));
+										current, level.level, opaque->btpo_level)));
 
 		/* Verify invariants for page */
 		bt_target_page_check(state);
@@ -1167,7 +1167,7 @@ bt_target_page_check(BtreeCheckState *state)
 				bt_child_highkey_check(state,
 									   offset,
 									   NULL,
-									   topaque->btpo.level);
+									   topaque->btpo_level);
 			}
 			continue;
 		}
@@ -1529,7 +1529,7 @@ bt_target_page_check(BtreeCheckState *state)
 	if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
 	{
 		bt_child_highkey_check(state, InvalidOffsetNumber,
-							   NULL, topaque->btpo.level);
+							   NULL, topaque->btpo_level);
 	}
 }
 
@@ -1606,7 +1606,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 		ereport(DEBUG1,
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
-						opaque->btpo.level, RelationGetRelationName(state->rel)),
+						opaque->btpo_level, RelationGetRelationName(state->rel)),
 				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
 
 		/* Be slightly more pro-active in freeing this memory, just in case */
@@ -1911,13 +1911,13 @@ bt_child_highkey_check(BtreeCheckState *state,
 										(uint32) state->targetlsn)));
 
 		/* Check level for non-ignorable page */
-		if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
+		if (!P_IGNORE(opaque) && opaque->btpo_level != target_level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										blkno, target_level - 1, opaque->btpo.level)));
+										blkno, target_level - 1, opaque->btpo_level)));
 
 		/* Try to detect circular links */
 		if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
@@ -2145,7 +2145,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
 	 * check for downlink connectivity.
 	 */
 	bt_child_highkey_check(state, downlinkoffnum,
-						   child, topaque->btpo.level);
+						   child, topaque->btpo_level);
 
 	/*
 	 * Since there cannot be a concurrent VACUUM operation in readonly mode,
@@ -2290,7 +2290,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 				 errmsg("harmless interrupted page split detected in index %s",
 						RelationGetRelationName(state->rel)),
 				 errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
-									blkno, opaque->btpo.level,
+									blkno, opaque->btpo_level,
 									opaque->btpo_prev,
 									(uint32) (pagelsn >> 32),
 									(uint32) pagelsn)));
@@ -2321,7 +2321,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 	elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
 		 RelationGetRelationName(state->rel));
 
-	level = opaque->btpo.level;
+	level = opaque->btpo_level;
 	itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
 	itup = (IndexTuple) PageGetItem(page, itemid);
 	childblk = BTreeTupleGetDownLink(itup);
@@ -2336,16 +2336,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			break;
 
 		/* Do an extra sanity check in passing on internal pages */
-		if (copaque->btpo.level != level - 1)
+		if (copaque->btpo_level != level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
 									 RelationGetRelationName(state->rel)),
 					 errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
 										blkno, childblk,
-										level - 1, copaque->btpo.level)));
+										level - 1, copaque->btpo_level)));
 
-		level = copaque->btpo.level;
+		level = copaque->btpo_level;
 		itemid = PageGetItemIdCareful(state, childblk, child,
 									  P_FIRSTDATAKEY(copaque));
 		itup = (IndexTuple) PageGetItem(child, itemid);
@@ -2407,7 +2407,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			 errmsg("internal index block lacks downlink in index \"%s\"",
 					RelationGetRelationName(state->rel)),
 			 errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
-								blkno, opaque->btpo.level,
+								blkno, opaque->btpo_level,
 								(uint32) (pagelsn >> 32),
 								(uint32) pagelsn)));
 }
@@ -3002,21 +3002,26 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	}
 
 	/*
-	 * Deleted pages have no sane "level" field, so can only check non-deleted
-	 * page level
+	 * Deleted pages that still use the old 32-bit XID representation have no
+	 * sane "level" field because they type pun the field, but all other pages
+	 * (including pages deleted on Postgres 14+) have a valid value.
 	 */
-	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
-						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+	if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
+	{
+		/* Okay, we can trust btpo_level field from page */
 
-	if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
-		opaque->btpo.level == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
-						blocknum, RelationGetRelationName(state->rel))));
+		if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+							opaque->btpo_level, blocknum, RelationGetRelationName(state->rel))));
+
+		if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+							blocknum, RelationGetRelationName(state->rel))));
+	}
 
 	/*
 	 * Sanity checks for number of items on page.
@@ -3064,7 +3069,9 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * from version 9.4 on, so do the same here.  See _bt_pagedel() for full
 	 * details.
 	 *
-	 * Internal pages should never have garbage items, either.
+	 * Also check that internal pages have no garbage items, and that no page
+	 * has an invalid combination of page level flags relating to deleted
+	 * pages.
 	 */
 	if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
 		ereport(ERROR,
@@ -3079,6 +3086,12 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 				 errmsg("internal page block %u in index \"%s\" has garbage items",
 						blocknum, RelationGetRelationName(state->rel))));
 
+	if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid page flag combination for block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
 	return page;
 }
 
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..bb81c699cd 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -75,11 +75,7 @@ typedef struct BTPageStat
 	/* opaque data */
 	BlockNumber btpo_prev;
 	BlockNumber btpo_next;
-	union
-	{
-		uint32		level;
-		TransactionId xact;
-	}			btpo;
+	uint32		btpo_level;
 	uint16		btpo_flags;
 	BTCycleId	btpo_cycleid;
 } BTPageStat;
@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* page type (flags) */
 	if (P_ISDELETED(opaque))
 	{
-		stat->type = 'd';
-		stat->btpo.xact = opaque->btpo.xact;
-		return;
+		/* We divide deleted pages into leaf ('d') or internal ('D') */
+		if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
+			stat->type = 'd';
+		else
+			stat->type = 'D';
+
+		/*
+		 * Report safexid in a deleted page.
+		 *
+		 * Handle pg_upgrade'd deleted pages that used the previous safexid
+		 * representation in btpo_level field (this used to be a union type
+		 * called "bpto").
+		 */
+		if (P_HAS_FULLXID(opaque))
+		{
+			FullTransactionId safexid = BTPageGetDeleteXid(page);
+
+			elog(NOTICE, "deleted page from block %u has safexid %u:%u",
+				 blkno, EpochFromFullTransactionId(safexid),
+				 XidFromFullTransactionId(safexid));
+		}
+		else
+			elog(NOTICE, "deleted page from block %u has safexid %u",
+				 blkno, opaque->btpo_level);
+
+		/* Don't interpret BTDeletedPageContents as index tuples */
+		maxoff = InvalidOffsetNumber;
 	}
 	else if (P_IGNORE(opaque))
 		stat->type = 'e';
@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* btpage opaque data */
 	stat->btpo_prev = opaque->btpo_prev;
 	stat->btpo_next = opaque->btpo_next;
-	stat->btpo.level = opaque->btpo.level;
+	stat->btpo_level = opaque->btpo_level;
 	stat->btpo_flags = opaque->btpo_flags;
 	stat->btpo_cycleid = opaque->btpo_cycleid;
 
@@ -237,7 +257,8 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 	values[j++] = psprintf("%u", stat.free_size);
 	values[j++] = psprintf("%u", stat.btpo_prev);
 	values[j++] = psprintf("%u", stat.btpo_next);
-	values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
+	/* The "btpo" field now only stores btpo_level, never an xact */
+	values[j++] = psprintf("%u", stat.btpo_level);
 	values[j++] = psprintf("%d", stat.btpo_flags);
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
@@ -503,10 +524,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
 
-		if (P_ISDELETED(opaque))
-			elog(NOTICE, "page is deleted");
-
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -603,7 +628,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		if (P_ISDELETED(opaque))
 			elog(NOTICE, "page is deleted");
 
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block is deleted");
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index b1ce0d77d7..5368bb30f0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 		page = BufferGetPage(buffer);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		/* Determine page type, and update totals */
-
+		/*
+		 * Determine page type, and update totals.
+		 *
+		 * Note that we arbitrarily bucket deleted pages together without
+		 * considering if they're leaf pages or internal pages.
+		 */
 		if (P_ISDELETED(opaque))
 			indexStat.deleted_pages++;
 		else if (P_IGNORE(opaque))
-- 
2.27.0

v1-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchapplication/x-patch; name=v1-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchDownload

From 93292c7104856f069b8f94d6ec31832e6f4f977c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 7 Feb 2021 19:24:03 -0800
Subject: [PATCH v1 2/2] Add pages_newly_deleted to VACUUM VERBOSE.

pages_newly_deleted reports on the number of pages deleted by the
current VACUUM operation.  The pages_deleted field continues to report
on the total number of deleted pages in the index (as well as pages that
are recyclable due to being zeroed in rare cases), without regard to
whether or not this VACUUM operation deleted them.
---
 src/include/access/genam.h            | 12 ++++---
 src/include/access/nbtree.h           |  3 +-
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  |  2 ++
 src/backend/access/heap/vacuumlazy.c  |  6 ++--
 src/backend/access/nbtree/nbtpage.c   | 46 +++++++++++++++++++--------
 src/backend/access/nbtree/nbtree.c    | 27 ++++++++++++----
 src/backend/access/spgist/spgvacuum.c |  1 +
 8 files changed, 70 insertions(+), 28 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..09a9aa3c29 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -64,10 +64,11 @@ typedef struct IndexVacuumInfo
  * to communicate additional private data to amvacuumcleanup.
  *
  * Note: pages_removed is the amount by which the index physically shrank,
- * if any (ie the change in its total size on disk).  pages_deleted and
- * pages_free refer to free space within the index file.  Some index AMs
- * may compute num_index_tuples by reference to num_heap_tuples, in which
- * case they should copy the estimated_count field from IndexVacuumInfo.
+ * if any (ie the change in its total size on disk).  pages_deleted,
+ * pages_newly_deleted, and pages_free refer to free space within the index
+ * file.  Some index AMs may compute num_index_tuples by reference to
+ * num_heap_tuples, in which case they should copy the estimated_count field
+ * from IndexVacuumInfo.
  */
 typedef struct IndexBulkDeleteResult
 {
@@ -76,7 +77,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
+	BlockNumber pages_newly_deleted; /* # pages marked deleted by us  */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 17083e9d76..897368b4d3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1149,7 +1149,8 @@ extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
 extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  FullTransactionId *oldestSafeXid);
+						  FullTransactionId *oldestSafeXid,
+						  BlockNumber *ndeletedcount);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..7504f57a03 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -232,6 +232,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	END_CRIT_SECTION();
 
 	gvs->result->pages_deleted++;
+	gvs->result->pages_newly_deleted++;
 }
 
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..2935491ec9 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -139,6 +139,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_free = 0;
 
 	/*
@@ -640,6 +641,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
 	stats->pages_deleted++;
+	stats->pages_newly_deleted++;
 
 	/* remove the downlink from the parent */
 	MarkBufferDirty(parentBuffer);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..addf243e40 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,10 +2521,12 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages have been deleted, %u are newly deleted, %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (*stats)->pages_deleted,
+						   (*stats)->pages_newly_deleted,
+						   (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 86652fff29..746033612e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -51,7 +51,7 @@ static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
 									 FullTransactionId *oldestSafeXid,
-									 uint32 *ndeleted);
+									 BlockNumber *ndeletedcount);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1787,11 +1787,24 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Returns the number of pages successfully physically deleted (zero if page
+ * cannot be deleted now; could be more than one if parent or right sibling
+ * pages were deleted too).  Caller uses return value to maintain bulk stats'
+ * pages_newly_deleted value.
+ *
+ * Maintains *ndeletedcount for caller, which is ultimately used as the
+ * pages_deleted value in bulk delete stats for entire VACUUM.  When we're
+ * called *ndeletedcount is the current total count of pages deleted in the
+ * index prior to current scanblkno block/position in btvacuumscan.  This
+ * includes existing deleted pages (pages deleted by a previous VACUUM), and
+ * pages that we delete during current VACUUM.  We need to cooperate closely
+ * with caller here so that whole VACUUM operation reliably avoids any double
+ * counting of subsidiary-to-leafbuf pages that we delete in passing.  If such
+ * pages happen to be from a block number that is ahead of the current
+ * scanblkno position, then caller is expected to count them directly later
+ * on.  It's simpler for us to understand caller's requirements than it would
+ * be for caller to understand when or how a deleted page became deleted after
+ * the fact.
  *
  * Maintains *oldestSafeXid for any pages that get deleted.  Caller is
  * responsible for maintaining *oldestSafeXid in the case of pages that were
@@ -1803,7 +1816,8 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid)
+_bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid,
+			BlockNumber *ndeletedcount)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1813,7 +1827,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are accounted for by *ndeletedcount.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -2012,7 +2026,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid)
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
 										  &rightsib_empty, oldestSafeXid,
-										  &ndeleted))
+										  ndeletedcount))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2025,6 +2039,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, FullTransactionId *oldestSafeXid)
 				Assert(false);
 				return ndeleted;
 			}
+			ndeleted++;
+			/* _bt_unlink_halfdead_page probably incremented ndeletedcount */
 		}
 
 		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
@@ -2304,7 +2320,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 						 bool *rightsib_empty,
-						 FullTransactionId *oldestSafeXid, uint32 *ndeleted)
+						 FullTransactionId *oldestSafeXid,
+						 BlockNumber *ndeletedcount)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2719,12 +2736,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		*oldestSafeXid = safexid;
 
 	/*
+	 * Maintain ndeletedcount for entire call to _bt_pagedel.  Used to
+	 * maintain pages_deleted bulk delete stats for entire VACUUM operation.
+	 *
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * and count it as deleted then, we count it as deleted in pages_deleted
+	 * by current btvacuumpage call.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		(*ndeletedcount)++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 27b41a4979..e5d6c5768d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -981,6 +981,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 
 	/* Set up info to pass down to btvacuumpage */
 	vstate.info = info;
@@ -1076,7 +1077,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that pages placed in the FSM are reported as deleted pages in the
 	 * bulk delete statistics, despite not counting as deleted pages for the
-	 * purposes of determining the oldest safexid.
+	 * purposes of determining the oldest safexid.  We generally expect that
+	 * the oldest safexid will come from one of the deleted pages that gets
+	 * counted in pages_newly_deleted.  In rare cases it will come from a page
+	 * that was marked deleted by a previous VACUUM operation which was not
+	 * safe to place in the FSM, even during this VACUUM operation.
 	 */
 	_bt_update_meta_cleanup_info(rel, vstate.oldestSafeXid,
 								 info->num_heap_tuples);
@@ -1178,8 +1183,8 @@ backtrack:
 		 * since _bt_pagedel() sometimes deletes the right sibling page of
 		 * scanblkno in passing (it does so after we decided where to
 		 * backtrack to).  We don't need to process this page as a deleted
-		 * page a second time now (in fact, it would be wrong to count it as a
-		 * deleted page in the bulk delete statistics a second time).
+		 * page a second time now (in fact, it would be wrong to double count
+		 * it in the pages_deleted field from bulk delete statistics).
 		 */
 		if (opaque->btpo_cycleid != vstate->cycleid || P_ISDELETED(opaque))
 		{
@@ -1436,12 +1441,20 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel return value is simply the number of pages directly
+		 * deleted on each call.  This is just added to pages_newly_deleted,
+		 * which counts the number of pages marked deleted in current VACUUM.
+		 *
+		 * We need to maintain pages_deleted more carefully here, though.  We
+		 * cannot just add the same _bt_pagedel return value to pages_deleted
+		 * because that double-counts pages that are deleted within
+		 * _bt_pagedel that will become scanblkno in a later call here from
+		 * btvacuumscan.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestSafeXid);
+		stats->pages_newly_deleted += _bt_pagedel(rel, buf,
+												  &vstate->oldestSafeXid,
+												  &stats->pages_deleted);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#1)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems.

There is an unresolved question for my deleted page XID patch: what
should it do about the vacuum_cleanup_index_scale_factor feature,
which added an XID to the metapage (its btm_oldest_btpo_xact field). I
refer to the work done by commit 857f9c36cda for Postgres 11 by
Masahiko. It would be good to get your opinion on this as the original
author of that feature, Masahiko.

To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
all deleted pages in the index, so clearly it needs to be carefully
considered in my patch to make the XIDs 64-bit. Even still, v1 of my
patch from today more or less ignores the issue -- it just gets a
32-bit version of the oldest value as determined by the oldestBtpoXact
XID tracking stuff (which is largely unchanged, except that it works
with 64-bit Full Transaction Ids now).

Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
field to wrap around in v1 of my patch. The obvious thing to do here
is to add a new epoch metapage field, effectively making
btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
idea. The only reason that we have the btm_oldest_btpo_xact field in
the first place is to ameliorate the problem that the patch
comprehensively solves! We should stop storing *any* XIDs in the
metapage. (Besides, adding a new "epoch" field to the metapage would
be relatively messy.)

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

1. Stop maintaining the oldest XID among all deleted pages in the
entire nbtree index during VACUUM. So we can remove all of the
BTVacState.oldestBtpoXact XID tracking stuff, which is currently
something that even _bt_pagedel() needs special handling for.

2. Stop considering the btm_oldest_btpo_xact metapage field in
_bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
cares about maintaining reasonably accurate statistics for the index.
Which is really how the vacuum_cleanup_index_scale_factor feature was
intended to work all along, anyway -- ISTM that the oldestBtpoXact
stuff was always just an afterthought to paper-over this annoying
32-bit XID issue.

3. We cannot actually remove the btm_oldest_btpo_xact XID field from
the metapage, because of course that would change the BTMetaPageData
struct layout, which breaks on-disk compatibility. But why not use it
for something useful instead? _bt_update_meta_cleanup_info() can use
the same field to store the number of "newly deleted" pages from the
last btbulkdelete() instead. (See my email from earlier for the
definition of "newly deleted".)

4. Now _bt_vacuum_needs_cleanup() can once again consider the
btm_oldest_btpo_xact metapage field -- except in a totally different
way, because now it means something totally different: "newly deleted
pages during last btbulkdelete() call" (per item 3). If this # pages
is very high then we probably should do a full call to btvacuumscan()
-- _bt_vacuum_needs_cleanup() will return true to make that happen.

It's unlikely but still possible that a high number of "newly deleted
pages during the last btbulkdelete() call" is in itself a good enough
reason to do a full btvacuumscan() call when the question of calling
btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
here conservatively covers that. Maybe the 32-bit-XID-in-metapage
triggering condition had some non-obvious value due to a natural
tendency for it to limit the number of deleted pages that go
unrecycled for a long time. (Or maybe there never really was any such
natural tendency -- still seems like a good idea to make the change
described by item 4.)

Even though we are conservative (at least in this sense I just
described), we nevertheless don't actually care about very old deleted
pages that we have not yet recycled -- provided there are not very
many of them. I'm thinking of "~2% of index" as the new "newly deleted
during last btbulkdelete() call" threshold applied within
_bt_vacuum_needs_cleanup(). There is no good reason why older
deleted-but-not-yet-recycled pages should be considered more valuable
than any other page that can be used when there is a page split.

Observations about on-disk compatibility with my patch + this 4 point scheme:

A. It doesn't matter that pg_upgrade'd indexes will have an XID value
in btm_oldest_btpo_xact that now gets incorrectly interpreted as
"newly deleted pages during last btbulkdelete() call" under the 4
point scheme I just outlined.

The spurious value will get cleaned up on the next VACUUM anyway
(whether VACUUM goes through btbulkdelete() or through
btvacuumcleanup()). Besides, most indexes always have a
btm_oldest_btpo_xact value of 0.

B. The patch I posted earlier doesn't actually care about the
BTREE_VERSION of the index at all. And neither does any of the stuff I
just described for a future v2 of my patch.

All indexes can use the new format for deleted pages. On-disk
compatibility is easy here because the contents of deleted pages only
need to work as a tombstone. We can safely assume that old-format
deleted pages (pre-Postgres 14 format deleted pages) must be safe to
recycle, because the pg_upgrade itself restarts Postgres. There can be
no backends that have dangling references to the old-format deleted
page.

C. All supported nbtree versions (all nbtree versions
BTREE_MIN_VERSION+) get the same benefits under this scheme.

Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
BTREE_NOVAC_VERSION/version 3 indexes via a call to
_bt_upgrademetapage() -- that has been the case since BTREE_VERSION
was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
have the btm_oldest_btpo_xact metapage field that I now propose to
reuse to track "newly deleted pages during last btbulkdelete() call",
per point 4.

In summary: There are no special cases here. No BTREE_VERSION related
difficulties. That seems like a huge advantage to me.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#2)

2 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 9, 2021 at 5:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

Attached is v2, which deals with the metapage 32-bit
XID/btm_oldest_btpo_xact issue using the approach I described earlier.
We don't store an XID in the metapage anymore in v2. This seems to
work well, as I expected it would.

--
Peter Geoghegan

Attachments:

v2-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchapplication/octet-stream; name=v2-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchDownload

From d34737432d11b14acad35dfec1e29b87bb0f0ab4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v2 1/2] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing within GiST.

Stop storing an XID for that tracks the oldest safexid across all
deleted pages in an index altogether.  There is no longer any point in
doing this.  It only ever made sense when bpto.xact fields could
wraparound.

The old btm_oldest_btpo_xact metapage field has been repurposed in a way
that preserves on-disk compatibility for pg_upgrade.  Rename this uint32
field, and use it to store the number of deleted pages that we expect to
be able to recycle during the next btvacuumcleanup() that actually scans
the index.  This approach is a little unorthodox, but we were already
using btm_oldest_btpo_xact (now called btm_last_cleanup_num_delpages) in
approximately the same way. And in exactly the same place: inside the
_bt_vacuum_needs_cleanup() function.

The general assumption is that we ought to be able to recycle however
many pages btm_last_cleanup_num_delpages indicates by deciding to scan
the index during a btvacuumcleanup() call (_bt_vacuum_needs_cleanup()'s
decision).  Note that manually issued VACUUMs won't be able to recycle
btm_last_cleanup_num_delpages pages (and _bt_vacuum_needs_cleanup()
won't instruct btvacuumcleanup() to skip scanning the index) unless at
least one XID is consumed between VACUUMs.
---
 src/include/access/nbtree.h           |  88 ++++++++++---
 src/include/access/nbtxlog.h          |  28 +++--
 src/include/storage/standby.h         |   2 +
 src/backend/access/gist/gistxlog.c    |  24 +---
 src/backend/access/nbtree/nbtinsert.c |  24 ++--
 src/backend/access/nbtree/nbtpage.c   | 170 ++++++++++++++------------
 src/backend/access/nbtree/nbtree.c    | 133 ++++++++++----------
 src/backend/access/nbtree/nbtsearch.c |   6 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/nbtree/nbtxlog.c   |  39 +++---
 src/backend/access/rmgrdesc/nbtdesc.c |  17 +--
 src/backend/storage/ipc/standby.c     |  28 +++++
 contrib/amcheck/verify_nbtree.c       |  76 +++++++-----
 contrib/pageinspect/btreefuncs.c      |  65 +++++++---
 contrib/pgstattuple/pgstatindex.c     |   8 +-
 15 files changed, 432 insertions(+), 278 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..7b6a897e4a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
  *
  *	In addition, we store the page's btree level (counting upwards from
  *	zero at a leaf page) as well as some flag bits indicating the page type
- *	and status.  If the page is deleted, we replace the level with the
- *	next-transaction-ID value indicating when it is safe to reclaim the page.
+ *	and status.  If the page is deleted, a BTDeletedPageContents struct is
+ *	stored in the page's tuple area, while a standard BTPageOpaqueData struct
+ *	is stored in the page special area.
  *
  *	We also store a "vacuum cycle ID".  When a page is split while VACUUM is
  *	processing the index, a nonzero value associated with the VACUUM run is
@@ -52,17 +53,24 @@ typedef uint16 BTCycleId;
  *
  *	NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
  *	instead.
+ *
+ *	NOTE: the btpo_level field used to be a union type in order to allow
+ *	deleted pages to store a 32-bit safexid is space now used only for page
+ *	level.  PostgreSQL 14+ consistently maintain the BTP_LEAF flag, as well as
+ *	the btpo_level field, which can be useful during testing and analysis.
+ *
+ *	(Actually, that's not quite true.  It's still possible for a pg_upgraded'd
+ *	database to have a BTP_DELETED page that's not marked BTP_HAS_FULLXID, in
+ *	which case btpo_level will not in fact store the page level.  This limited
+ *	exception is inconsequential -- we simply assume that such a page is safe
+ *	to recycle anyway.)
  */
 
 typedef struct BTPageOpaqueData
 {
 	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
 	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
-	union
-	{
-		uint32		level;		/* tree level --- zero for leaf pages */
-		TransactionId xact;		/* next transaction ID, if deleted */
-	}			btpo;
+	uint32		btpo_level;		/* tree level --- zero for leaf pages */
 	uint16		btpo_flags;		/* flag bits, see below */
 	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
 } BTPageOpaqueData;
@@ -78,6 +86,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_FULLXID	(1 << 8)	/* page has a BTDeletedPageContents */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -105,10 +114,12 @@ typedef struct BTMetaPageData
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
 	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
-	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
-										 * pages */
-	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
-													 * during last cleanup */
+
+	/* number of deleted, non-recyclable pages during last cleanup */
+	uint32		btm_last_cleanup_num_delpages;
+	/* number of heap tuples during last cleanup */
+	float8		btm_last_cleanup_num_heap_tuples;
+
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
@@ -220,6 +231,55 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		(((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
 #define P_HAS_GARBAGE(opaque)	(((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
 #define P_INCOMPLETE_SPLIT(opaque)	(((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
+#define P_HAS_FULLXID(opaque)	(((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
+
+/*
+ * On a deleted page, we store this struct.  A deleted page doesn't contain
+ * any tuples, so we don't use the normal page layout with line pointers.
+ * Instead, this struct is stored right after the standard page header.
+ */
+typedef struct BTDeletedPageContents
+{
+	/* last xid which could see the page in a scan */
+	FullTransactionId safexid;
+} BTDeletedPageContents;
+
+static inline void
+BTPageSetDeleted(Page page, FullTransactionId safexid)
+{
+	BTPageOpaque			opaque;
+	PageHeader				header;
+	BTDeletedPageContents  *contents;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	header = ((PageHeader) page);
+
+	opaque->btpo_flags &= ~BTP_HALF_DEAD;
+	opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
+	header->pd_lower =
+			MAXALIGN(SizeOfPageHeaderData) + sizeof(BTDeletedPageContents);
+	header->pd_upper = header->pd_special;
+
+	/* Set safexid */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	contents->safexid = safexid;
+}
+
+static inline FullTransactionId
+BTPageGetDeleteXid(Page page)
+{
+	BTPageOpaque			opaque PG_USED_FOR_ASSERTS_ONLY;
+	BTDeletedPageContents  *contents;
+
+	/* pg_upgrade'd indexes with old BTP_DELETED pages should not call here */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISDELETED(opaque) && !P_ISHALFDEAD(opaque) &&
+		   P_HAS_FULLXID(opaque));
+
+	/* Get safexid */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	return contents->safexid;
+}
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -1067,7 +1127,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
-										 TransactionId oldestBtpoXact, float8 numHeapTuples);
+										 BlockNumber pages_deleted_not_recycled,
+										 float8 numHeapTuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
@@ -1091,8 +1152,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..5f2bfd3b27 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -13,6 +13,7 @@
 #ifndef NBTXLOG_H
 #define NBTXLOG_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/off.h"
@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
 	uint32		level;
 	BlockNumber fastroot;
 	uint32		fastlevel;
-	TransactionId oldest_btpo_xact;
+	uint32		last_cleanup_num_delpages;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
 } xl_btree_metadata;
@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } xl_btree_reuse_page;
 
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
 #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  Note we do
- * not store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about deletion of a btree page.  Note that we
+ * only leave behind a small amount of bookkeeping information in deleted
+ * pages (deleted pages must be kept around as tombstones for a while).  It is
+ * convenient for the REDO routine to regenerate its target page from scratch.
+ * This is why WAL record describes certain details that are actually directly
+ * available from the target page.
  *
  * Backup Blk 0: target block being deleted
  * Backup Blk 1: target block's left sibling, if any
@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
 {
 	BlockNumber leftsib;		/* target block's left sibling, if any */
 	BlockNumber rightsib;		/* target block's right sibling */
+	uint32		level;			/* target block's level */
 
 	/*
-	 * Information needed to recreate the leaf page, when target is an
-	 * internal page.
+	 * Information needed to recreate a half-dead leaf page with correct
+	 * topparent link.  The fields are only used when deletion operation's
+	 * target page is an internal page.  REDO routine creates half-dead page
+	 * from scratch to keep things simple (this is the same convenient
+	 * approach used for the target page itself).
 	 */
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
-	BlockNumber topparent;		/* next child down in the subtree */
+	BlockNumber topparent;
 
-	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
+	FullTransactionId safexid;	/* BTPageSetDeleted() value */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
 } xl_btree_unlink_page;
 
-#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, safexid) + sizeof(FullTransactionId))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..38fd85a431 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+													   RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f2eda79bc1..1c80eae044 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
-		FullTransactionId nextXid = ReadNextFullTransactionId();
-		uint64		diff;
-
-		/*
-		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
-		 * TransactionIds, so truncate the logged FullTransactionId. If the
-		 * logged value is very old, so that XID wrap-around already happened
-		 * on it, there can't be any snapshots that still see it.
-		 */
-		diff = U64FromFullTransactionId(nextXid) -
-			U64FromFullTransactionId(latestRemovedFullXid);
-		if (diff < MaxTransactionId / 2)
-		{
-			TransactionId latestRemovedXid;
-
-			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
-												xlrec->node);
-		}
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..1edb9f9579 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
-			if (metad->btm_fastlevel >= opaque->btpo.level)
+			if (metad->btm_fastlevel >= opaque->btpo_level)
 			{
 				/* no update wanted */
 				_bt_relbuf(rel, metabuf);
@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
 			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = BufferGetBlockNumber(buf);
-			metad->btm_fastlevel = opaque->btpo.level;
+			metad->btm_fastlevel = opaque->btpo_level;
 			MarkBufferDirty(metabuf);
 		}
 
@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel,
 					xlmeta.level = metad->btm_level;
 					xlmeta.fastroot = metad->btm_fastroot;
 					xlmeta.fastlevel = metad->btm_fastlevel;
-					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 					xlmeta.last_cleanup_num_heap_tuples =
 						metad->btm_last_cleanup_num_heap_tuples;
 					xlmeta.allequalimage = metad->btm_allequalimage;
@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	/* handle btpo_next after rightpage buffer acquired */
-	lopaque->btpo.level = oopaque->btpo.level;
+	lopaque->btpo_level = oopaque->btpo_level;
 	/* handle btpo_cycleid after rightpage buffer acquired */
 
 	/*
@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = oopaque->btpo_next;
-	ropaque->btpo.level = oopaque->btpo.level;
+	ropaque->btpo_level = oopaque->btpo_level;
 	ropaque->btpo_cycleid = lopaque->btpo_cycleid;
 
 	/*
@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
 
-		xlrec.level = ropaque->btpo.level;
+		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstrightoff = firstrightoff;
 		xlrec.newitemoff = newitemoff;
@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags = BTP_ROOT;
-	rootopaque->btpo.level =
-		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
+	rootopaque->btpo_level =
+		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
 	rootopaque->btpo_cycleid = 0;
 
 	/* update metapage data */
 	metad->btm_root = rootblknum;
-	metad->btm_level = rootopaque->btpo.level;
+	metad->btm_level = rootopaque->btpo_level;
 	metad->btm_fastroot = rootblknum;
-	metad->btm_fastlevel = rootopaque->btpo.level;
+	metad->btm_fastlevel = rootopaque->btpo_level;
 
 	/*
 	 * Insert the left page pointer into the new root page.  The root page is
@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
 		md.fastlevel = metad->btm_level;
-		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+		md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ac264a5952..00aea725cb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -37,7 +37,7 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   TransactionId latestRemovedXid);
+							   FullTransactionId latestRemovedFullXid);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
@@ -50,7 +50,6 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 TransactionId *oldestBtpoXact,
 									 uint32 *ndeleted);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
@@ -78,7 +77,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_level = level;
 	metad->btm_fastroot = rootbknum;
 	metad->btm_fastlevel = level;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
 
@@ -118,7 +117,7 @@ _bt_upgrademetapage(Page page)
 
 	/* Set version number and fill extra fields added into version 3 */
 	metad->btm_version = BTREE_NOVAC_VERSION;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
@@ -176,7 +175,8 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  *		to those written in the metapage.  On mismatch, metapage is overwritten.
  */
 void
-_bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
+_bt_update_meta_cleanup_info(Relation rel,
+							 BlockNumber pages_deleted_not_recycled,
 							 float8 numHeapTuples)
 {
 	Buffer		metabuf;
@@ -185,6 +185,9 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
 
+	StaticAssertStmt(sizeof(BlockNumber) == sizeof(TransactionId),
+					 "on-disk compatibility assumption violated");
+
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
@@ -193,8 +196,9 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
-	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+	else if (metad->btm_last_cleanup_num_delpages != pages_deleted_not_recycled)
+		needsRewrite = true;
+	else if (metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -214,7 +218,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
-	metad->btm_oldest_btpo_xact = oldestBtpoXact;
+	metad->btm_last_cleanup_num_delpages = pages_deleted_not_recycled;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
 	MarkBufferDirty(metabuf);
 
@@ -232,7 +236,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
-		md.oldest_btpo_xact = oldestBtpoXact;
+		/* XXX last_cleanup_num_delpages is actually pages_deleted_not_recycled */
+		md.last_cleanup_num_delpages = pages_deleted_not_recycled;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
 
@@ -316,7 +321,7 @@ _bt_getroot(Relation rel, int access)
 		 * because that's not set in a "fast root".
 		 */
 		if (!P_IGNORE(rootopaque) &&
-			rootopaque->btpo.level == rootlevel &&
+			rootopaque->btpo_level == rootlevel &&
 			P_LEFTMOST(rootopaque) &&
 			P_RIGHTMOST(rootopaque))
 		{
@@ -377,7 +382,7 @@ _bt_getroot(Relation rel, int access)
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 		rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 		rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
-		rootopaque->btpo.level = 0;
+		rootopaque->btpo_level = 0;
 		rootopaque->btpo_cycleid = 0;
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
@@ -393,7 +398,7 @@ _bt_getroot(Relation rel, int access)
 		metad->btm_level = 0;
 		metad->btm_fastroot = rootblkno;
 		metad->btm_fastlevel = 0;
-		metad->btm_oldest_btpo_xact = InvalidTransactionId;
+		metad->btm_last_cleanup_num_delpages = 0;
 		metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
 		MarkBufferDirty(rootbuf);
@@ -416,7 +421,7 @@ _bt_getroot(Relation rel, int access)
 			md.level = 0;
 			md.fastroot = rootblkno;
 			md.fastlevel = 0;
-			md.oldest_btpo_xact = InvalidTransactionId;
+			md.last_cleanup_num_delpages = 0;
 			md.last_cleanup_num_heap_tuples = -1.0;
 			md.allequalimage = metad->btm_allequalimage;
 
@@ -481,11 +486,11 @@ _bt_getroot(Relation rel, int access)
 			rootblkno = rootopaque->btpo_next;
 		}
 
-		/* Note: can't check btpo.level on deleted pages */
-		if (rootopaque->btpo.level != rootlevel)
+		/* Note: can't check btpo_level from !P_HAS_FULLXID() deleted page */
+		if (rootopaque->btpo_level != rootlevel)
 			elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 				 rootblkno, RelationGetRelationName(rel),
-				 rootopaque->btpo.level, rootlevel);
+				 rootopaque->btpo_level, rootlevel);
 	}
 
 	/*
@@ -585,11 +590,11 @@ _bt_gettrueroot(Relation rel)
 		rootblkno = rootopaque->btpo_next;
 	}
 
-	/* Note: can't check btpo.level on deleted pages */
-	if (rootopaque->btpo.level != rootlevel)
+	/* Note: can't check btpo_level from !P_HAS_FULLXID() deleted page */
+	if (rootopaque->btpo_level != rootlevel)
 		elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 			 rootblkno, RelationGetRelationName(rel),
-			 rootopaque->btpo.level, rootlevel);
+			 rootopaque->btpo_level, rootlevel);
 
 	return rootbuf;
 }
@@ -762,7 +767,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   FullTransactionId latestRemovedFullXid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -775,7 +781,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedFullXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -862,17 +868,18 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * If we are generating WAL for Hot Standby then create a
 					 * WAL record that will allow us to conflict with queries
 					 * running on standby, in case they have snapshots older
-					 * than btpo.xact.  This can only apply if the page does
-					 * have a valid btpo.xact value, ie not if it's new.  (We
-					 * must check that because an all-zero page has no special
-					 * space.)
+					 * than safexid value returned by BTPageGetDeleteXid().
+					 * This can only apply if the page does have a valid
+					 * safexid value, ie not if it's new.  (We must check that
+					 * because an all-zero page has no special space.)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
 						!PageIsNew(page))
 					{
-						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+						FullTransactionId latestRemovedFullXid;
 
-						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
+						latestRemovedFullXid = BTPageGetDeleteXid(page);
+						_bt_log_reuse_page(rel, blkno, latestRemovedFullXid);
 					}
 
 					/* Okay to use page.  Re-initialize and return it */
@@ -1101,9 +1108,31 @@ _bt_page_recyclable(Page page)
 	 * interested in it.
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_ISDELETED(opaque) &&
-		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
-		return true;
+	if (P_ISDELETED(opaque))
+	{
+		/*
+		 * If this is a pg_upgrade'd index, then this could be a deleted page
+		 * whose XID (which is stored in special area's level field via type
+		 * punning) is non-full 32-bit value.  It's safe to just assume that
+		 * we can recycle because the system must have been restarted since
+		 * the time of deletion.
+		 */
+		if (!P_HAS_FULLXID(opaque))
+			return true;
+
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
+	}
+
 	return false;
 }
 
@@ -1768,16 +1797,12 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * that the btvacuumscan scan has yet to reach; they'll get counted later
  * instead.
  *
- * Maintains *oldestBtpoXact for any pages that get deleted.  Caller is
- * responsible for maintaining *oldestBtpoXact in the case of pages that were
- * deleted by a previous VACUUM.
- *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
+_bt_pagedel(Relation rel, Buffer leafbuf)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1985,8 +2010,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, oldestBtpoXact,
-										  &ndeleted))
+										  &rightsib_empty, &ndeleted))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2001,9 +2025,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 			}
 		}
 
-		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
-		Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
-											*oldestBtpoXact));
+		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
+			   P_HAS_FULLXID(opaque));
 
 		rightsib = opaque->btpo_next;
 
@@ -2264,12 +2287,6 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  * containing leafbuf.  (We always set *rightsib_empty for caller, just to be
  * consistent.)
  *
- * We maintain *oldestBtpoXact for pages that are deleted by the current
- * VACUUM operation here.  This must be handled here because we conservatively
- * assume that there needs to be a new call to ReadNewTransactionId() each
- * time a page gets deleted.  See comments about the underlying assumption
- * below.
- *
  * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
  * On success exit, we'll be holding pin and write lock.  On failure exit,
  * we'll release both pin and lock before returning (we define it that way
@@ -2277,8 +2294,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, TransactionId *oldestBtpoXact,
-						 uint32 *ndeleted)
+						 bool *rightsib_empty, uint32 *ndeleted)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2294,12 +2310,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	BTMetaPageData *metad = NULL;
 	ItemId		itemid;
 	Page		page;
-	PageHeader	header;
 	BTPageOpaque opaque;
+	FullTransactionId safexid;
 	bool		rightsib_is_rightmost;
-	int			targetlevel;
+	uint32		targetlevel;
 	IndexTuple	leafhikey;
-	BlockNumber nextchild;
+	BlockNumber topparent_in_target;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2343,7 +2359,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
-		targetlevel = opaque->btpo.level;
+		targetlevel = opaque->btpo_level;
 		Assert(targetlevel > 0);
 
 		/*
@@ -2450,7 +2466,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
-		nextchild = InvalidBlockNumber;
+
+		/* Leaf page is also target page: don't set topparent */
+		topparent_in_target = InvalidBlockNumber;
 	}
 	else
 	{
@@ -2459,11 +2477,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
 
-		/* Remember the next non-leaf child down in the subtree */
+		/* Internal page is target: we'll set topparent in leaf page... */
 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
-		nextchild = BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
-		if (nextchild == leafblkno)
-			nextchild = InvalidBlockNumber;
+		topparent_in_target =
+				BTreeTupleGetTopParent((IndexTuple) PageGetItem(page, itemid));
+		/* ...except when it would be a redundant pointer-to-self */
+		if (topparent_in_target == leafblkno)
+			topparent_in_target = InvalidBlockNumber;
 	}
 
 	/*
@@ -2553,13 +2573,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * no lock was held.
 	 */
 	if (target != leafblkno)
-		BTreeTupleSetTopParent(leafhikey, nextchild);
+		BTreeTupleSetTopParent(leafhikey, topparent_in_target);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
-	 * updated links to the target, ReadNewTransactionId() suffices as an
+	 * updated links to the target, ReadNextFullTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
@@ -2568,17 +2588,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
-	opaque->btpo_flags &= ~BTP_HALF_DEAD;
-	opaque->btpo_flags |= BTP_DELETED;
-	opaque->btpo.xact = ReadNewTransactionId();
 
 	/*
-	 * Remove the remaining tuples on the page.  This keeps things simple for
-	 * WAL consistency checking.
+	 * Store upper bound XID that's used to determine when deleted page is no
+	 * longer needed as a tombstone
 	 */
-	header = (PageHeader) page;
-	header->pd_lower = SizeOfPageHeaderData;
-	header->pd_upper = header->pd_special;
+	safexid = ReadNextFullTransactionId();
+	BTPageSetDeleted(page, safexid);
+	opaque->btpo_cycleid = 0;
 
 	/* And update the metapage, if needed */
 	if (BufferIsValid(metabuf))
@@ -2616,15 +2633,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (target != leafblkno)
 			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
 
-		/* information on the unlinked block */
+		/* information stored on the target/to-be-unlinked block */
 		xlrec.leftsib = leftsib;
 		xlrec.rightsib = rightsib;
-		xlrec.btpo_xact = opaque->btpo.xact;
+		xlrec.level = targetlevel;
+		xlrec.safexid = safexid;
 
 		/* information needed to recreate the leaf block (if not the target) */
 		xlrec.leafleftsib = leafleftsib;
 		xlrec.leafrightsib = leafrightsib;
-		xlrec.topparent = nextchild;
+		xlrec.topparent = topparent_in_target;
 
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
 
@@ -2638,7 +2656,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
 			xlmeta.fastlevel = metad->btm_fastlevel;
-			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+			xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
@@ -2681,9 +2699,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, lbuf);
 	_bt_relbuf(rel, rbuf);
 
-	if (!TransactionIdIsValid(*oldestBtpoXact) ||
-		TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
-		*oldestBtpoXact = opaque->btpo.xact;
+	/* If the target is not leafbuf, we're done with it now -- release it */
+	if (target != leafblkno)
+		_bt_relbuf(rel, buf);
 
 	/*
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
@@ -2693,10 +2711,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		(*ndeleted)++;
 
-	/* If the target is not leafbuf, we're done with it now -- release it */
-	if (target != leafblkno)
-		_bt_relbuf(rel, buf);
-
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..a9dc9c48dc 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -47,7 +47,6 @@ typedef struct
 	void	   *callback_state;
 	BTCycleId	cycleid;
 	BlockNumber totFreePages;	/* true total # of free pages */
-	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
 } BTVacState;
 
@@ -802,66 +801,69 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		result = false;
+	BTOptions  *relopts;
+	float8		cleanup_scale_factor;
+	uint32		btm_version;
+	BlockNumber prev_pages_deleted_not_recycled;
+	float8		prev_num_heap_tuples;
 
+	/*
+	 * Copy details from metapage to local variables quickly.
+	 *
+	 * Note that we deliberately avoid using cached version of metapage here.
+	 */
 	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	btm_version = metad->btm_version;
+
+	if (btm_version < BTREE_NOVAC_VERSION)
+	{
+		/*
+		 * Metapage needs to be dynamically upgraded to store fields that are
+		 * only present when btm_version >= BTREE_NOVAC_VERSION
+		 */
+		_bt_relbuf(info->index, metabuf);
+		return true;
+	}
+
+	prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	prev_pages_deleted_not_recycled = metad->btm_last_cleanup_num_delpages;
+	_bt_relbuf(info->index, metabuf);
 
 	/*
-	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
-	 * aggressive about vacuuming non catalog relations by passing the table
-	 * to GlobalVisCheckRemovableXid().
+	 * If table receives enough insertions and no cleanup was performed,
+	 * then index would appear have stale statistics.  If scale factor is
+	 * set, we avoid that by performing cleanup if the number of inserted
+	 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
+	 * original tuples count.
 	 */
+	relopts = (BTOptions *) info->index->rd_options;
+	cleanup_scale_factor = (relopts &&
+							relopts->vacuum_cleanup_index_scale_factor >= 0)
+		? relopts->vacuum_cleanup_index_scale_factor
+		: vacuum_cleanup_index_scale_factor;
 
-	if (metad->btm_version < BTREE_NOVAC_VERSION)
-	{
-		/*
-		 * Do cleanup if metapage needs upgrade, because we don't have
-		 * cleanup-related meta-information yet.
-		 */
-		result = true;
-	}
-	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
-	{
-		/*
-		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is visible to everyone, then at least one deleted page can be
-		 * recycled -- don't skip cleanup.
-		 */
-		result = true;
-	}
-	else
-	{
-		BTOptions  *relopts;
-		float8		cleanup_scale_factor;
-		float8		prev_num_heap_tuples;
+	if (cleanup_scale_factor <= 0 ||
+		info->num_heap_tuples < 0 ||
+		prev_num_heap_tuples <= 0 ||
+		(info->num_heap_tuples - prev_num_heap_tuples) /
+		prev_num_heap_tuples >= cleanup_scale_factor)
+		return true;
 
-		/*
-		 * If table receives enough insertions and no cleanup was performed,
-		 * then index would appear have stale statistics.  If scale factor is
-		 * set, we avoid that by performing cleanup if the number of inserted
-		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
-		 * original tuples count.
-		 */
-		relopts = (BTOptions *) info->index->rd_options;
-		cleanup_scale_factor = (relopts &&
-								relopts->vacuum_cleanup_index_scale_factor >= 0)
-			? relopts->vacuum_cleanup_index_scale_factor
-			: vacuum_cleanup_index_scale_factor;
-		prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	/*
+	 * Trigger cleanup in rare cases where prev_pages_deleted_not_recycled
+	 * exceeds a significant fraction of the total size of the index.  We can
+	 * reasonably expect (though are not guaranteed) to be able to recycle
+	 * this many pages during cleanup-only btvacuumscan call.  This alone
+	 * might be reason enough to proceed with btvacuumscan call.
+	 */
+	Assert(!info->analyze_only);
+	if (prev_pages_deleted_not_recycled >
+		RelationGetNumberOfBlocks(info->index) / 20)
+		return true;
 
-		if (cleanup_scale_factor <= 0 ||
-			info->num_heap_tuples < 0 ||
-			prev_num_heap_tuples <= 0 ||
-			(info->num_heap_tuples - prev_num_heap_tuples) /
-			prev_num_heap_tuples >= cleanup_scale_factor)
-			result = true;
-	}
-
-	_bt_relbuf(info->index, metabuf);
-	return result;
+	return false;
 }
 
 /*
@@ -973,6 +975,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber num_pages;
 	BlockNumber scanblkno;
 	bool		needLock;
+	BlockNumber pages_deleted_not_recycled;
 
 	/*
 	 * Reset counts that will be incremented during the scan; needed in case
@@ -989,7 +992,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
 	vstate.totFreePages = 0;
-	vstate.oldestBtpoXact = InvalidTransactionId;
 
 	/* Create a temporary memory context to run _bt_pagedel in */
 	vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
@@ -1066,18 +1068,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 
 	/*
-	 * Maintain the oldest btpo.xact and a count of the current number of heap
-	 * tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
+	 * Maintain the count of the current number of heap tuples in the
+	 * metapage.  Also maintain the last pages_deleted_not_recycled.  Both
+	 * values are used within _bt_vacuum_needs_cleanup.
 	 *
-	 * The page with the oldest btpo.xact is typically a page deleted by this
-	 * VACUUM operation, since pages deleted by a previous VACUUM operation
-	 * tend to be placed in the FSM (by the current VACUUM operation) -- such
-	 * pages are not candidates to be the oldest btpo.xact.  (Note that pages
-	 * placed in the FSM are reported as deleted pages in the bulk delete
-	 * statistics, despite not counting as deleted pages for the purposes of
-	 * determining the oldest btpo.xact.)
+	 * pages_deleted_not_recycled is the number of deleted pages now in the
+	 * index that were not safe to place in the FSM to be recycled just yet.
 	 */
-	_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+	pages_deleted_not_recycled = stats->pages_deleted - vstate.totFreePages;
+	Assert(stats->pages_deleted >= vstate.totFreePages);
+	_bt_update_meta_cleanup_info(rel, pages_deleted_not_recycled,
 								 info->num_heap_tuples);
 
 	/* update statistics */
@@ -1203,17 +1203,12 @@ backtrack:
 		 * recycle yet.
 		 */
 		stats->pages_deleted++;
-
-		/* Maintain the oldest btpo.xact */
-		if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
-			TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
-			vstate->oldestBtpoXact = opaque->btpo.xact;
 	}
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * oldestBtpoXact and pages_deleted below.
+		 * pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1430,7 +1425,7 @@ backtrack:
 		 * count.  There will be no double-counting.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
+		stats->pages_deleted += _bt_pagedel(rel, buf);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2e3bda8171..d1177d8772 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
-		if (opaque->btpo.level == 1 && access == BT_WRITE)
+		if (opaque->btpo_level == 1 && access == BT_WRITE)
 			page_access = BT_WRITE;
 
 		/* drop the read lock on the page, then acquire one on its child */
@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 		}
 
 		/* Done? */
-		if (opaque->btpo.level == level)
+		if (opaque->btpo_level == level)
 			break;
-		if (opaque->btpo.level < level)
+		if (opaque->btpo_level < level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("btree level %u not found in index \"%s\"",
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5683daa34d..2c4d7f6e25 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
 	/* Initialize BT opaque state */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_prev = opaque->btpo_next = P_NONE;
-	opaque->btpo.level = level;
+	opaque->btpo_level = level;
 	opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
 	opaque->btpo_cycleid = 0;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..b6afe9526e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_fastlevel = xlrec->fastlevel;
 	/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
-	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
+	md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
 
@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
 
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = spagenumber;
-	ropaque->btpo.level = xlrec->level;
+	ropaque->btpo_level = xlrec->level;
 	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = xlrec->leftblk;
 	pageop->btpo_next = xlrec->rightblk;
-	pageop->btpo.level = 0;
+	pageop->btpo_level = 0;
 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
+	uint32		level;
+	bool		isleaf;
+	FullTransactionId safexid;
 	Buffer		leftbuf;
 	Buffer		target;
 	Buffer		rightbuf;
@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	leftsib = xlrec->leftsib;
 	rightsib = xlrec->rightsib;
+	level = xlrec->level;
+	isleaf = (level == 0);
+	safexid = xlrec->safexid;
+
+	/* No topparent link for leaf page (level 0) or level 1 */
+	Assert(xlrec->topparent == InvalidBlockNumber || level > 1);
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = leftsib;
 	pageop->btpo_next = rightsib;
-	pageop->btpo.xact = xlrec->btpo_xact;
-	pageop->btpo_flags = BTP_DELETED;
-	if (!BlockNumberIsValid(xlrec->topparent))
+	pageop->btpo_level = level;
+	BTPageSetDeleted(page, safexid);
+	if (isleaf)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		Buffer				leafbuf;
 		IndexTupleData		trunctuple;
 
+		Assert(!isleaf);
+
 		leafbuf = XLogInitBufferForRedo(record, 3);
 		page = (Page) BufferGetPage(leafbuf);
 
@@ -901,7 +912,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 		pageop->btpo_prev = xlrec->leafleftsib;
 		pageop->btpo_next = xlrec->leafrightsib;
-		pageop->btpo.level = 0;
+		pageop->btpo_level = 0;
 		pageop->btpo_cycleid = 0;
 
 		/* Add a dummy hikey item */
@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
 
 	pageop->btpo_flags = BTP_ROOT;
 	pageop->btpo_prev = pageop->btpo_next = P_NONE;
-	pageop->btpo.level = xlrec->level;
+	pageop->btpo_level = xlrec->level;
 	if (xlrec->level == 0)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
@@ -972,17 +983,15 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The
-	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisCheckRemovableFullXid(deleteXid) test in _bt_page_recyclable()
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..5cce10a5b6 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -80,9 +80,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
 
-				appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
-								 xlrec->leftsib, xlrec->rightsib,
-								 xlrec->btpo_xact);
+				appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
+								 xlrec->leftsib, xlrec->rightsib, xlrec->level,
+								 EpochFromFullTransactionId(xlrec->safexid),
+								 XidFromFullTransactionId(xlrec->safexid));
 				appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
 								 xlrec->leafleftsib, xlrec->leafrightsib,
 								 xlrec->topparent);
@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
 
-				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
+				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->latestRemovedXid);
+								 xlrec->node.relNode,
+								 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+								 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 				break;
 			}
 		case XLOG_BTREE_META_CLEANUP:
@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
-								 xlrec->oldest_btpo_xact,
+				appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
+								 xlrec->last_cleanup_num_delpages,
 								 xlrec->last_cleanup_num_heap_tuples);
 				break;
 			}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0eeb766943 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+/*
+ * Variant of ResolveRecoveryConflictWithSnapshot that works with
+ * FullTransactionId values
+ */
+void
+ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+										   RelFileNode node)
+{
+	/*
+	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
+	 * so truncate the logged FullTransactionId.  If the logged value is very
+	 * old, so that XID wrap-around already happened on it, there can't be any
+	 * snapshots that still see it.
+	 */
+	FullTransactionId nextXid = ReadNextFullTransactionId();
+	uint64			  diff;
+
+	diff = U64FromFullTransactionId(nextXid) -
+		U64FromFullTransactionId(latestRemovedFullXid);
+	if (diff < MaxTransactionId / 2)
+	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+	}
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b8c7793d9e..c184ccb323 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 											  P_FIRSTDATAKEY(opaque));
 				itup = (IndexTuple) PageGetItem(state->target, itemid);
 				nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
-				nextleveldown.level = opaque->btpo.level - 1;
+				nextleveldown.level = opaque->btpo_level - 1;
 			}
 			else
 			{
@@ -795,13 +795,13 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 			bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
 
 		/* Check level, which must be valid for non-ignorable page */
-		if (level.level != opaque->btpo.level)
+		if (level.level != opaque->btpo_level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										current, level.level, opaque->btpo.level)));
+										current, level.level, opaque->btpo_level)));
 
 		/* Verify invariants for page */
 		bt_target_page_check(state);
@@ -1167,7 +1167,7 @@ bt_target_page_check(BtreeCheckState *state)
 				bt_child_highkey_check(state,
 									   offset,
 									   NULL,
-									   topaque->btpo.level);
+									   topaque->btpo_level);
 			}
 			continue;
 		}
@@ -1529,7 +1529,7 @@ bt_target_page_check(BtreeCheckState *state)
 	if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
 	{
 		bt_child_highkey_check(state, InvalidOffsetNumber,
-							   NULL, topaque->btpo.level);
+							   NULL, topaque->btpo_level);
 	}
 }
 
@@ -1606,7 +1606,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 		ereport(DEBUG1,
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
-						opaque->btpo.level, RelationGetRelationName(state->rel)),
+						opaque->btpo_level, RelationGetRelationName(state->rel)),
 				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
 
 		/* Be slightly more pro-active in freeing this memory, just in case */
@@ -1911,13 +1911,13 @@ bt_child_highkey_check(BtreeCheckState *state,
 										(uint32) state->targetlsn)));
 
 		/* Check level for non-ignorable page */
-		if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
+		if (!P_IGNORE(opaque) && opaque->btpo_level != target_level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										blkno, target_level - 1, opaque->btpo.level)));
+										blkno, target_level - 1, opaque->btpo_level)));
 
 		/* Try to detect circular links */
 		if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
@@ -2145,7 +2145,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
 	 * check for downlink connectivity.
 	 */
 	bt_child_highkey_check(state, downlinkoffnum,
-						   child, topaque->btpo.level);
+						   child, topaque->btpo_level);
 
 	/*
 	 * Since there cannot be a concurrent VACUUM operation in readonly mode,
@@ -2290,7 +2290,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 				 errmsg("harmless interrupted page split detected in index %s",
 						RelationGetRelationName(state->rel)),
 				 errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
-									blkno, opaque->btpo.level,
+									blkno, opaque->btpo_level,
 									opaque->btpo_prev,
 									(uint32) (pagelsn >> 32),
 									(uint32) pagelsn)));
@@ -2321,7 +2321,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 	elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
 		 RelationGetRelationName(state->rel));
 
-	level = opaque->btpo.level;
+	level = opaque->btpo_level;
 	itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
 	itup = (IndexTuple) PageGetItem(page, itemid);
 	childblk = BTreeTupleGetDownLink(itup);
@@ -2336,16 +2336,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			break;
 
 		/* Do an extra sanity check in passing on internal pages */
-		if (copaque->btpo.level != level - 1)
+		if (copaque->btpo_level != level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
 									 RelationGetRelationName(state->rel)),
 					 errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
 										blkno, childblk,
-										level - 1, copaque->btpo.level)));
+										level - 1, copaque->btpo_level)));
 
-		level = copaque->btpo.level;
+		level = copaque->btpo_level;
 		itemid = PageGetItemIdCareful(state, childblk, child,
 									  P_FIRSTDATAKEY(copaque));
 		itup = (IndexTuple) PageGetItem(child, itemid);
@@ -2407,7 +2407,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			 errmsg("internal index block lacks downlink in index \"%s\"",
 					RelationGetRelationName(state->rel)),
 			 errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
-								blkno, opaque->btpo.level,
+								blkno, opaque->btpo_level,
 								(uint32) (pagelsn >> 32),
 								(uint32) pagelsn)));
 }
@@ -3002,21 +3002,26 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	}
 
 	/*
-	 * Deleted pages have no sane "level" field, so can only check non-deleted
-	 * page level
+	 * Deleted pages that still use the old 32-bit XID representation have no
+	 * sane "level" field because they type pun the field, but all other pages
+	 * (including pages deleted on Postgres 14+) have a valid value.
 	 */
-	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
-						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+	if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
+	{
+		/* Okay, no reason not to trust btpo_level field from page */
 
-	if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
-		opaque->btpo.level == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
-						blocknum, RelationGetRelationName(state->rel))));
+		if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+							opaque->btpo_level, blocknum, RelationGetRelationName(state->rel))));
+
+		if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+							blocknum, RelationGetRelationName(state->rel))));
+	}
 
 	/*
 	 * Sanity checks for number of items on page.
@@ -3064,7 +3069,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * from version 9.4 on, so do the same here.  See _bt_pagedel() for full
 	 * details.
 	 *
-	 * Internal pages should never have garbage items, either.
+	 * Also check that internal pages have no garbage items, and that no page
+	 * has an invalid combination of page deletion related page level flags.
 	 */
 	if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
 		ereport(ERROR,
@@ -3079,6 +3085,18 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 				 errmsg("internal page block %u in index \"%s\" has garbage items",
 						blocknum, RelationGetRelationName(state->rel))));
 
+	if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
+								 blocknum, RelationGetRelationName(state->rel))));
+
+	if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
+								 blocknum, RelationGetRelationName(state->rel))));
+
 	return page;
 }
 
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..dfac1a9716 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -75,11 +75,7 @@ typedef struct BTPageStat
 	/* opaque data */
 	BlockNumber btpo_prev;
 	BlockNumber btpo_next;
-	union
-	{
-		uint32		level;
-		TransactionId xact;
-	}			btpo;
+	uint32		btpo_level;
 	uint16		btpo_flags;
 	BTCycleId	btpo_cycleid;
 } BTPageStat;
@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* page type (flags) */
 	if (P_ISDELETED(opaque))
 	{
-		stat->type = 'd';
-		stat->btpo.xact = opaque->btpo.xact;
-		return;
+		/* We divide deleted pages into leaf ('d') or internal ('D') */
+		if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
+			stat->type = 'd';
+		else
+			stat->type = 'D';
+
+		/*
+		 * Report safexid in a deleted page.
+		 *
+		 * Handle pg_upgrade'd deleted pages that used the previous safexid
+		 * representation in btpo_level field (this used to be a union type
+		 * called "bpto").
+		 */
+		if (P_HAS_FULLXID(opaque))
+		{
+			FullTransactionId safexid = BTPageGetDeleteXid(page);
+
+			elog(NOTICE, "deleted page from block %u has safexid %u:%u",
+				 blkno, EpochFromFullTransactionId(safexid),
+				 XidFromFullTransactionId(safexid));
+		}
+		else
+			elog(NOTICE, "deleted page from block %u has safexid %u",
+				 blkno, opaque->btpo_level);
+
+		/* Don't interpret BTDeletedPageContents as index tuples */
+		maxoff = InvalidOffsetNumber;
 	}
 	else if (P_IGNORE(opaque))
 		stat->type = 'e';
@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* btpage opaque data */
 	stat->btpo_prev = opaque->btpo_prev;
 	stat->btpo_next = opaque->btpo_next;
-	stat->btpo.level = opaque->btpo.level;
+	stat->btpo_level = opaque->btpo_level;
 	stat->btpo_flags = opaque->btpo_flags;
 	stat->btpo_cycleid = opaque->btpo_cycleid;
 
@@ -237,7 +257,8 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 	values[j++] = psprintf("%u", stat.free_size);
 	values[j++] = psprintf("%u", stat.btpo_prev);
 	values[j++] = psprintf("%u", stat.btpo_next);
-	values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
+	/* The "btpo" field now only stores btpo_level, never an xact */
+	values[j++] = psprintf("%u", stat.btpo_level);
 	values[j++] = psprintf("%d", stat.btpo_flags);
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
@@ -503,10 +524,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
 
-		if (P_ISDELETED(opaque))
-			elog(NOTICE, "page is deleted");
-
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -603,7 +628,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		if (P_ISDELETED(opaque))
 			elog(NOTICE, "page is deleted");
 
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block is deleted");
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -723,7 +755,8 @@ bt_metap(PG_FUNCTION_ARGS)
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
+		/* XXX: btm_last_cleanup_num_delpages used to be btm_oldest_btpo_xact */
+		values[j++] = psprintf("%u", metad->btm_last_cleanup_num_delpages);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index b1ce0d77d7..5368bb30f0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 		page = BufferGetPage(buffer);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		/* Determine page type, and update totals */
-
+		/*
+		 * Determine page type, and update totals.
+		 *
+		 * Note that we arbitrarily bucket deleted pages together without
+		 * considering if they're leaf pages or internal pages.
+		 */
 		if (P_ISDELETED(opaque))
 			indexStat.deleted_pages++;
 		else if (P_IGNORE(opaque))
-- 
2.27.0

v2-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchapplication/octet-stream; name=v2-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchDownload

From 5f3561794421c6622f36871e848f55fd574fb501 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 7 Feb 2021 19:24:03 -0800
Subject: [PATCH v2 2/2] Add pages_newly_deleted to VACUUM VERBOSE.

pages_newly_deleted reports on the number of pages deleted by the
current VACUUM operation.  The pages_deleted field continues to report
on the total number of deleted pages in the index (as well as pages that
are recyclable due to being zeroed in rare cases), without regard to
whether or not this VACUUM operation deleted them.
---
 src/include/access/genam.h            | 12 +++++---
 src/include/access/nbtree.h           |  3 +-
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  |  2 ++
 src/backend/access/heap/vacuumlazy.c  |  6 ++--
 src/backend/access/nbtree/nbtpage.c   | 44 +++++++++++++++++++--------
 src/backend/access/nbtree/nbtree.c    | 26 +++++++++++-----
 src/backend/access/spgist/spgvacuum.c |  1 +
 8 files changed, 67 insertions(+), 28 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..09a9aa3c29 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -64,10 +64,11 @@ typedef struct IndexVacuumInfo
  * to communicate additional private data to amvacuumcleanup.
  *
  * Note: pages_removed is the amount by which the index physically shrank,
- * if any (ie the change in its total size on disk).  pages_deleted and
- * pages_free refer to free space within the index file.  Some index AMs
- * may compute num_index_tuples by reference to num_heap_tuples, in which
- * case they should copy the estimated_count field from IndexVacuumInfo.
+ * if any (ie the change in its total size on disk).  pages_deleted,
+ * pages_newly_deleted, and pages_free refer to free space within the index
+ * file.  Some index AMs may compute num_index_tuples by reference to
+ * num_heap_tuples, in which case they should copy the estimated_count field
+ * from IndexVacuumInfo.
  */
 typedef struct IndexBulkDeleteResult
 {
@@ -76,7 +77,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
+	BlockNumber pages_newly_deleted; /* # pages marked deleted by us  */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7b6a897e4a..7871387c0d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1152,7 +1152,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
+						  BlockNumber *ndeletedcount);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..7504f57a03 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -232,6 +232,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	END_CRIT_SECTION();
 
 	gvs->result->pages_deleted++;
+	gvs->result->pages_newly_deleted++;
 }
 
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..2935491ec9 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -139,6 +139,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_free = 0;
 
 	/*
@@ -640,6 +641,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
 	stats->pages_deleted++;
+	stats->pages_newly_deleted++;
 
 	/* remove the downlink from the parent */
 	MarkBufferDirty(parentBuffer);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..addf243e40 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,10 +2521,12 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages have been deleted, %u are newly deleted, %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (*stats)->pages_deleted,
+						   (*stats)->pages_newly_deleted,
+						   (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 00aea725cb..8726ee3e35 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BlockNumber *ndeletedcount);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1791,18 +1791,31 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Returns the number of pages successfully physically deleted (zero if page
+ * cannot be deleted now; could be more than one if parent or right sibling
+ * pages were deleted too).  Caller uses return value to maintain bulk stats'
+ * pages_newly_deleted value.
+ *
+ * Maintains *ndeletedcount for caller, which is ultimately used as the
+ * pages_deleted value in bulk delete stats for entire VACUUM.  When we're
+ * called *ndeletedcount is the current total count of pages deleted in the
+ * index prior to current scanblkno block/position in btvacuumscan.  This
+ * includes existing deleted pages (pages deleted by a previous VACUUM), and
+ * pages that we delete during current VACUUM.  We need to cooperate closely
+ * with caller here so that whole VACUUM operation reliably avoids any double
+ * counting of subsidiary-to-leafbuf pages that we delete in passing.  If such
+ * pages happen to be from a block number that is ahead of the current
+ * scanblkno position, then caller is expected to count them directly later
+ * on.  It's simpler for us to understand caller's requirements than it would
+ * be for caller to understand when or how a deleted page became deleted after
+ * the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+_bt_pagedel(Relation rel, Buffer leafbuf, BlockNumber *ndeletedcount)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1812,7 +1825,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are accounted for by *ndeletedcount.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -2010,7 +2023,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, ndeletedcount))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2023,6 +2036,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				Assert(false);
 				return ndeleted;
 			}
+			ndeleted++;
+			/* _bt_unlink_halfdead_page probably incremented ndeletedcount */
 		}
 
 		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
@@ -2294,7 +2309,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BlockNumber *ndeletedcount)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2704,12 +2719,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
+	 * Maintain ndeletedcount for entire call to _bt_pagedel.  Used to
+	 * maintain pages_deleted bulk delete stats for entire VACUUM operation.
+	 *
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * and count it as deleted then, we count it as deleted in pages_deleted
+	 * by current btvacuumpage call.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		(*ndeletedcount)++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a9dc9c48dc..9dc9cf3761 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -984,6 +984,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 
 	/* Set up info to pass down to btvacuumpage */
 	vstate.info = info;
@@ -1074,9 +1075,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * pages_deleted_not_recycled is the number of deleted pages now in the
 	 * index that were not safe to place in the FSM to be recycled just yet.
+	 * This is almost the same thing as pages_newly_deleted from the bulk
+	 * delete stats, except that it also includes pages that were deleted by a
+	 * previous VACUUM that are nevertheless not possible to recycle even now.
 	 */
 	pages_deleted_not_recycled = stats->pages_deleted - vstate.totFreePages;
 	Assert(stats->pages_deleted >= vstate.totFreePages);
+	Assert(pages_deleted_not_recycled >= stats->pages_newly_deleted);
 	_bt_update_meta_cleanup_info(rel, pages_deleted_not_recycled,
 								 info->num_heap_tuples);
 
@@ -1177,8 +1182,8 @@ backtrack:
 		 * since _bt_pagedel() sometimes deletes the right sibling page of
 		 * scanblkno in passing (it does so after we decided where to
 		 * backtrack to).  We don't need to process this page as a deleted
-		 * page a second time now (in fact, it would be wrong to count it as a
-		 * deleted page in the bulk delete statistics a second time).
+		 * page a second time now (in fact, it would be wrong to double count
+		 * it in the pages_deleted field from bulk delete statistics).
 		 */
 		if (opaque->btpo_cycleid != vstate->cycleid || P_ISDELETED(opaque))
 		{
@@ -1208,7 +1213,7 @@ backtrack:
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * pages_deleted within _bt_pagedel.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1420,12 +1425,19 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel return value is simply the number of pages directly
+		 * deleted on each call.  This is just added to pages_newly_deleted,
+		 * which counts the number of pages marked deleted in current VACUUM.
+		 *
+		 * We need to maintain pages_deleted more carefully here, though.  We
+		 * cannot just add the same _bt_pagedel return value to pages_deleted
+		 * because that double-counts pages that are deleted within
+		 * _bt_pagedel that will become scanblkno in a later call here from
+		 * btvacuumscan.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		stats->pages_newly_deleted +=
+			_bt_pagedel(rel, buf, &stats->pages_deleted);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

Heikki Linnakangas

hlinnaka@iki.fi

almost 5 years ago

In reply to: Peter Geoghegan (#1)

Re: 64-bit XIDs in deleted nbtree pages

On 10/02/2021 00:14, Peter Geoghegan wrote:

There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.

Thanks for picking this up!

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.

Is it really worth the trouble to maintain 'level' on deleted pages? All
you currently do with it is check that the BTP_LEAF flag is set iff
"level == 0", which seems pointless. I guess there could be some
forensic value in keeping 'level', but meh.

The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.

The full message on master is:

INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.

How about:

INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.

The idea is that the first DETAIL line now says what the VACUUM did this
round, and the last line says what the state of the index is now. One
concern with that phrasing is that it might not be clear what "686 pages
were removed" means. We don't actually shrink the file. Then again, I'm
not sure if the "have been deleted" was any better in that regard.

It's still a bit weird that the "what VACUUM did this round" information
is sandwiched between the two other lines that talk about the state of
the index after the operation. But I think the language now makes it
more clear which is which. Or perhaps flip the INFO and first DETAIL
lines around like this:

INFO: 250000 index row versions and 686 pages were removed from index
"foo_pkey"
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

For context, the more full message you get on master is:

postgres=# vacuum verbose foo;
INFO: vacuuming "public.foo"
INFO: scanned index "foo_pkey" to remove 250000 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
INFO: "foo": removed 250000 row versions in 1107 pages
DETAIL: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "foo": found 250000 removable, 271 nonremovable row versions in
1108 out of 4425 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1164
There were 87 unused item identifiers.
Skipped 0 pages due to buffer pins, 2212 frozen pages.
0 pages are entirely empty.
CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.28 s.
VACUUM

That's pretty confusing, it's a mix of basically progress indicators
(vacuuming "public.foo"), CPU measurements, information about what was
removed, and what the state is afterwards. Would be nice to make that
more clear overall. But for now, for this particular INFO message,
perhaps make it more consistent with the lines printed by heapam, like this:

INFO: "foo_pkey": removed 250000 index row versions and 686 pages
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

- Heikki

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#2)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Feb 10, 2021 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote:

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems.

Thank you for working on this!

There is an unresolved question for my deleted page XID patch: what
should it do about the vacuum_cleanup_index_scale_factor feature,
which added an XID to the metapage (its btm_oldest_btpo_xact field). I
refer to the work done by commit 857f9c36cda for Postgres 11 by
Masahiko. It would be good to get your opinion on this as the original
author of that feature, Masahiko.

To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
all deleted pages in the index, so clearly it needs to be carefully
considered in my patch to make the XIDs 64-bit. Even still, v1 of my
patch from today more or less ignores the issue -- it just gets a
32-bit version of the oldest value as determined by the oldestBtpoXact
XID tracking stuff (which is largely unchanged, except that it works
with 64-bit Full Transaction Ids now).

Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
field to wrap around in v1 of my patch. The obvious thing to do here
is to add a new epoch metapage field, effectively making
btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
idea. The only reason that we have the btm_oldest_btpo_xact field in
the first place is to ameliorate the problem that the patch
comprehensively solves! We should stop storing *any* XIDs in the
metapage. (Besides, adding a new "epoch" field to the metapage would
be relatively messy.)

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

The purpose of btm_oldest_btpo_xact is to prevent deleted pages from
being leaked. As you mentioned, it has the oldest btpo.xact in
BTPageOpaqueData among all deleted pages in the index. Looking back to
the time when we develop INDEX_CLEANUP option if we skip index cleanup
(meaning both ambulkdelete and amvaucumcleanup), there was a problem
in btree indexes that deleted pages could never be recycled if XID
wraps around. So the idea behind btm_oldest_btpo_xact is, we remember
the oldest btpo.xact among the all deleted pages and do btvacuumscan()
if this value is older than global xmin (meaning there is at least one
recyclable page). That way, we can recycle the deleted pages without
leaking the pages (of course, unless INDEX_CLEANUP is disabled).

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

1. Stop maintaining the oldest XID among all deleted pages in the
entire nbtree index during VACUUM. So we can remove all of the
BTVacState.oldestBtpoXact XID tracking stuff, which is currently
something that even _bt_pagedel() needs special handling for.

2. Stop considering the btm_oldest_btpo_xact metapage field in
_bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
cares about maintaining reasonably accurate statistics for the index.
Which is really how the vacuum_cleanup_index_scale_factor feature was
intended to work all along, anyway -- ISTM that the oldestBtpoXact
stuff was always just an afterthought to paper-over this annoying
32-bit XID issue.

3. We cannot actually remove the btm_oldest_btpo_xact XID field from
the metapage, because of course that would change the BTMetaPageData
struct layout, which breaks on-disk compatibility. But why not use it
for something useful instead? _bt_update_meta_cleanup_info() can use
the same field to store the number of "newly deleted" pages from the
last btbulkdelete() instead. (See my email from earlier for the
definition of "newly deleted".)

4. Now _bt_vacuum_needs_cleanup() can once again consider the
btm_oldest_btpo_xact metapage field -- except in a totally different
way, because now it means something totally different: "newly deleted
pages during last btbulkdelete() call" (per item 3). If this # pages
is very high then we probably should do a full call to btvacuumscan()
-- _bt_vacuum_needs_cleanup() will return true to make that happen.

It's unlikely but still possible that a high number of "newly deleted
pages during the last btbulkdelete() call" is in itself a good enough
reason to do a full btvacuumscan() call when the question of calling
btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
here conservatively covers that. Maybe the 32-bit-XID-in-metapage
triggering condition had some non-obvious value due to a natural
tendency for it to limit the number of deleted pages that go
unrecycled for a long time. (Or maybe there never really was any such
natural tendency -- still seems like a good idea to make the change
described by item 4.)

Even though we are conservative (at least in this sense I just
described), we nevertheless don't actually care about very old deleted
pages that we have not yet recycled -- provided there are not very
many of them. I'm thinking of "~2% of index" as the new "newly deleted
during last btbulkdelete() call" threshold applied within
_bt_vacuum_needs_cleanup(). There is no good reason why older
deleted-but-not-yet-recycled pages should be considered more valuable
than any other page that can be used when there is a page split.

Interesting.

I like this idea that triggers btvacuumscan() if there are many newly
deleted pages. I think this would be helpful especially for the case
of bulk-deletion on the table. But why we use the number of *newly*
deleted pages but not the total number of deleted pages in the index?
IIUC if several btbulkdelete executions deleted index pages less than
2% of the index and those deleted pages could not be recycled yet,
then the number of recyclable pages would exceed 2% of the index in
total but amvacuumcleanup() would not trigger btvacuumscan() because
the last newly deleted pages are less than the 2% threshold. I might
be missing something though.

Also, we need to note that having newly deleted pages doesn't
necessarily mean these always are recyclable at that time. If the
global xmin is still older than deleted page's btpo.xact values, we
still could not recycle them. I think btm_oldest_btpo_xact probably
will help this case. That is, we store the oldest btpo.xact among
those newly deleted pages to btm_oldest_btpo_xact and we trigger
btvacuumscan() if there are many newly deleted pages (more than 2% of
index) and the btm_oldest_btpo_xact is older than the global xmin (I
suppose each newly deleted pages could have different btpo.xact).

Observations about on-disk compatibility with my patch + this 4 point scheme:

A. It doesn't matter that pg_upgrade'd indexes will have an XID value
in btm_oldest_btpo_xact that now gets incorrectly interpreted as
"newly deleted pages during last btbulkdelete() call" under the 4
point scheme I just outlined.

The spurious value will get cleaned up on the next VACUUM anyway
(whether VACUUM goes through btbulkdelete() or through
btvacuumcleanup()). Besides, most indexes always have a
btm_oldest_btpo_xact value of 0.

B. The patch I posted earlier doesn't actually care about the
BTREE_VERSION of the index at all. And neither does any of the stuff I
just described for a future v2 of my patch.

All indexes can use the new format for deleted pages. On-disk
compatibility is easy here because the contents of deleted pages only
need to work as a tombstone. We can safely assume that old-format
deleted pages (pre-Postgres 14 format deleted pages) must be safe to
recycle, because the pg_upgrade itself restarts Postgres. There can be
no backends that have dangling references to the old-format deleted
page.

C. All supported nbtree versions (all nbtree versions
BTREE_MIN_VERSION+) get the same benefits under this scheme.

Even BTREE_MIN_VERSION/version 2 indexes are dynamically upgradable to
BTREE_NOVAC_VERSION/version 3 indexes via a call to
_bt_upgrademetapage() -- that has been the case since BTREE_VERSION
was bumped to BTREE_NOVAC_VERSION/version 3 for Postgres 11's
vacuum_cleanup_index_scale_factor feature. So all nbtree indexes will
have the btm_oldest_btpo_xact metapage field that I now propose to
reuse to track "newly deleted pages during last btbulkdelete() call",
per point 4.

In summary: There are no special cases here. No BTREE_VERSION related
difficulties. That seems like a huge advantage to me.

Great! I'll look at the v2 patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Heikki Linnakangas (#4)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 9, 2021 at 11:58 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Thanks for picking this up!

I actually had a patch for this in 2019, albeit one that remained in
rough shape until recently. Must have forgotten about it.

Is it really worth the trouble to maintain 'level' on deleted pages? All
you currently do with it is check that the BTP_LEAF flag is set iff
"level == 0", which seems pointless. I guess there could be some
forensic value in keeping 'level', but meh.

What trouble is that? The only way in which it's inconvenient is that
we have to include the level field in xl_btree_unlink_page WAL records
for the first time. The structure of the relevant REDO routine (which
is called btree_xlog_unlink_page()) ought to explicitly recreate the
original page from scratch, without any special cases. This makes it
possible to pretend that there never was such a thing as an nbtree
page whose level field could not be relied on. I personally think that
it's simpler when seen in the wider context of how the code works and
is verified.

Besides, there is also amcheck to consider. I am a big believer in
amcheck, and see it as something that has enabled my work on the
B-Tree code over the past few years. Preserving the level field in
deleted pages increases our coverage just a little, and practically
eliminates cases where we cannot rely on the level field.

Of course it's still true that this detail (the deleted pages level
field question) will probably never seem important to anybody else. To
me it's one small detail of a broader strategy. No one detail of that
broader strategy, taken in isolation, will ever be crucially
important.

Of course it's also true that we should not assume that a very high
cost in performance/code/whatever can justify a much smaller benefit
in amcheck. But you haven't really explained why the cost seems
unacceptable to you. (Perhaps I missed something.)

How about:

INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.

The idea is that the first DETAIL line now says what the VACUUM did this
round, and the last line says what the state of the index is now. One
concern with that phrasing is that it might not be clear what "686 pages
were removed" means.

It's still a bit weird that the "what VACUUM did this round" information
is sandwiched between the two other lines that talk about the state of
the index after the operation. But I think the language now makes it
more clear which is which.

IMV our immediate goal for the new VACUUM VERBOSE output should be to
make the output as accurate and descriptive as possible (while still
using terminology that works for all index AMs, not just nbtree). I
don't think that we should give too much weight to making the
information easy to understand in isolation. Because that's basically
impossible -- it just doesn't work that way IME.

Confusion over the accounting of "deleted pages in indexes" vs "pages
deleted by this VACUUM" is not new. See my bugfix commit 73a076b0 to
see one vintage example. The relevant output of VACUUM VERBOSE
produced inconsistent results for perhaps as long as 15 years before I
noticed it and fixed it. I somehow didn't notice this despite using it
for various tests for my own B-Tree projects a year or two before the
fix. Tests that produced inconsistent results that I noticed pretty
early on, and yet assumed were all down to some subtlety that I didn't
yet understand.

My point is this: I am quite prepared to admit that these details
really are complicated. But that's not incidental to what's really
going on, or anything (though I agree with your later remarks on the
general tidiness of VACUUM VERBOSE -- it is a real dog's dinner).

I'm not saying that we should assume that no DBA will find the
relevant VACUUM VERBOSE output useful -- I don't think that at all. It
will be kind of rare for a user to really comb through it. But that's
mostly because big problems in this area are themselves kind of rare
(most individual indexes never have any deleted pages IME).

Any DBA consuming this output sensibly will consume it in a way that
makes sense in the *context of the problem that they're experiencing*,
whatever that might mean for them. They'll consider how it changes
over time for the same index. They'll try to correlate it with other
symptoms, or other problems, and make sense of it in a top-down
fashion. We should try to make it as descriptive as possible so that
DBAs will have the breadcrumbs they need to tie it back to whatever
the core issue happens to be -- maybe they'll have to read the source
code to get to the bottom of it. It's likely to be some rare issue in
those cases where the DBA really cares about the details -- it's
likely to be workload dependent.

Good DBAs spend much of their time on exceptional problems -- all the
easy problems will have been automated away already. Things like wait
events are popular with DBAs for this reason.

Or perhaps flip the INFO and first DETAIL
lines around like this:

INFO: 250000 index row versions and 686 pages were removed from index
"foo_pkey"
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

For context, the more full message you get on master is:

That's pretty confusing, it's a mix of basically progress indicators
(vacuuming "public.foo"), CPU measurements, information about what was
removed, and what the state is afterwards.

I agree that the output of VACUUM VERBOSE is messy. It's probably a
bunch of accretions that made sense in isolation, but added up to a
big mess over time. So I agree: now would be a good time to do
something about that.

It would also be nice to find a way to get this information in the
logs when log_autovacuum is enabled (perhaps only when the verbosity
is increased). I've discussed this with Masahiko in the context of his
recent work, actually. Even before we started talking about the XID
page deletion problem that I'm fixing here.

INFO: "foo_pkey": removed 250000 index row versions and 686 pages
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.

I can see what you mean here, and maybe we should do roughly what
you've outlined. Still, we should use terminology that isn't too far
removed from what actually happens in nbtree. What's a "removed" page?
The distinction between all of the different kinds of index pages that
might be involved here is just subtle. Again, better to use a precise,
descriptive term that nobody fully understands -- because hardly
anybody will fully understand it anyway (even including advanced users
that go on to find the VACUUM VERBOSE output very useful for whatever
reason).

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#5)

2 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for working on this!

I'm glad that I finally found time for it! It seems like it'll make
things easier elsewhere.

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

Cool.

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

Interesting.

I like this idea that triggers btvacuumscan() if there are many newly
deleted pages. I think this would be helpful especially for the case
of bulk-deletion on the table. But why we use the number of *newly*
deleted pages but not the total number of deleted pages in the index?

I was unclear here -- I should not have said "newly deleted" pages at
all. What I actually do when calling _bt_vacuum_needs_cleanup() is
this (from v3, at the end of btvacuumscan()):

-   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+   Assert(stats->pages_deleted >= stats->pages_free);
+   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
                                 info->num_heap_tuples);

We're actually passing something I have called
"pages_deleted_not_free" here, which is derived from the bulk delete
stats in the obvious way that you see here (subtraction). I'm not
using pages_newly_deleted at all now. Note also that the behavior
inside _bt_update_meta_cleanup_info() no longer varies based on
whether it is called during btvacuumcleanup() or during btbulkdelete()
-- the same rules apply either way. We want to store
pages_deleted_not_free in the metapage at the end of btvacuumscan(),
no matter what.

This same pages_deleted_not_free information is now used by
_bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
high (over 2.5%), then that will trigger a call to btbulkdelete() (we
won't skip scanning the index). Though in practice it probably won't
come up that often -- there just aren't ever that many deleted pages
in most indexes.

IIUC if several btbulkdelete executions deleted index pages less than
2% of the index and those deleted pages could not be recycled yet,
then the number of recyclable pages would exceed 2% of the index in
total but amvacuumcleanup() would not trigger btvacuumscan() because
the last newly deleted pages are less than the 2% threshold. I might
be missing something though.

I think you're right -- my idea of varying the behavior of
_bt_update_meta_cleanup_info() based on whether it's being called
during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
half the problem was that I explained the idea badly to begin with).
But, as I said, it's fixed in v3: we simply pass
"pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
now.

Does that make sense? Does it address this concern?

Also, we need to note that having newly deleted pages doesn't
necessarily mean these always are recyclable at that time. If the
global xmin is still older than deleted page's btpo.xact values, we
still could not recycle them. I think btm_oldest_btpo_xact probably
will help this case. That is, we store the oldest btpo.xact among
those newly deleted pages to btm_oldest_btpo_xact and we trigger
btvacuumscan() if there are many newly deleted pages (more than 2% of
index) and the btm_oldest_btpo_xact is older than the global xmin (I
suppose each newly deleted pages could have different btpo.xact).

I agree that having no XID in the metapage creates a new small
problem. Specifically, there are certain narrow cases that can cause
confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
exist before my patch (kind of).

The simplest example is easy to run into when debugging the patch on
your laptop. Because you're using your personal laptop, and not a real
production server, there will be no concurrent sessions that might
consume XIDs. You can run VACUUM VERBOSE manually several times, but
that alone will never be enough to enable VACUUM to recycle any of the
pages that the first VACUUM manages to delete (many to mark deleted,
reporting the pages as "newly deleted" via the new instrumentation
from the second patch). Note that the master branch is *also* unable
to recycle these deleted pages, simply because the "safe xid" never
gets old because there are no newly allocated XIDs to make it look old
(there are no allocated XIDs just because nothing else happens). That
in itself is not the new problem.

The new problem is that _bt_vacuum_needs_cleanup() will no longer
notice that the oldest XID among deleted-but-not-yet-recycled pages is
so old that it will not be able to recycle the pages anyway -- at
least not the oldest page, though in this specific case that will
apply to all deleted pages equally. We might as well not bother trying
yet, which the old code "gets right" -- but it doesn't get it right
for any good reason. That is, the old code won't have VACUUM scan the
index at all, so it "wins" in this specific scenario.

I think that's okay, though -- it's not a real problem, and actually
makes sense and has other advantages. This is why I believe it's okay:

* We really should never VACUUM the same table before even one or two
XIDs are allocated -- that's what happens in the simple laptop test
scenario that I described. Surely we should not be too concerned about
"doing the right thing" under this totally artificial set of
conditions.

(BTW, I've been using txid_current() for my own "laptop testing", as a
way to work around this issue.)

* More generally, if you really can't do recycling of pages that you
deleted during the last VACUUM during this VACUUM (perhaps because of
the presence of a long-running xact that holds open a snapshot), then
you have lots of *huge* problems already, and this is the least of
your concerns. Besides, at that point an affected VACUUM will be doing
work for an affected index through a btbulkdelete() call, so the
behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.

* As you pointed out already, the oldest XID/deleted page from the
index may be significantly older than the newest. Why should we bucket
them together?

We could easily have a case where most of the deleted pages can be
recycled -- even when all indexes were originally marked deleted by
the same VACUUM operation. If there are lots of pages that actually
can be recycled, it is probably a bad thing to assume that the oldest
XID is representative of all of them. After all, with the patch we
only go out of our way to recycle deleted pages when we are almost
sure that the total number of recyclable pages (pages marked deleted
during a previous VACUUM) exceeds 2.5% of the total size of the index.
That broad constraint is important here -- if we do nothing unless
there are lots of deleted pages anyway, we are highly unlikely to ever
err on the side of being too eager (not eager enough seems more likely
to me).

I think that we're justified in making a general assumption inside
_bt_vacuum_needs_cleanup() (which is documented at the point that we
call it, inside btvacuumscan()): The assumption that however many
index pages the metapage says we'll be able to recycle (whatever the
field says) will in fact turn out to be recyclable if we decide that
we need to. There are specific cases where that will be kind of wrong,
as I've gone into, but the assumption/design has many more advantages
than disadvantages.

I have tried to capture this in v3 of the patch. Can you take a look?
See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
comments when we call it inside btvacuumscan().

Do you think that those new comments are helpful? Does this address
your concern?

Thanks
--
Peter Geoghegan

Attachments:

v3-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchapplication/octet-stream; name=v3-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchDownload

From bfc1d3b51cd3a3cf21d454e4d73de10ccc8eefbf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v3 1/2] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing within GiST.

Stop storing an XID for that tracks the oldest safexid across all
deleted pages in an index altogether.  There is no longer any point in
doing this.  It only ever made sense when bpto.xact fields could
wraparound.

The old btm_oldest_btpo_xact metapage field has been repurposed in a way
that preserves on-disk compatibility for pg_upgrade.  Rename this uint32
field, and use it to store the number of deleted pages that we expect to
be able to recycle during the next btvacuumcleanup() that actually scans
the index.  This approach is a little unorthodox, but we were already
using btm_oldest_btpo_xact (now called btm_last_cleanup_num_delpages) in
approximately the same way. And in exactly the same place: inside the
_bt_vacuum_needs_cleanup() function.

The general assumption is that we ought to be able to recycle however
many pages btm_last_cleanup_num_delpages indicates by deciding to scan
the index during a btvacuumcleanup() call (_bt_vacuum_needs_cleanup()'s
decision).  Note that manually issued VACUUMs won't be able to recycle
btm_last_cleanup_num_delpages pages (and _bt_vacuum_needs_cleanup()
won't instruct btvacuumcleanup() to skip scanning the index) unless at
least one XID is consumed between VACUUMs.

Bump XLOG_PAGE_MAGIC due to WAL record switch over to full XIDs.
---
 src/include/access/nbtree.h           |  79 ++++++++--
 src/include/access/nbtxlog.h          |  28 ++--
 src/include/storage/standby.h         |   2 +
 src/backend/access/gist/gistxlog.c    |  24 +---
 src/backend/access/nbtree/nbtinsert.c |  25 ++--
 src/backend/access/nbtree/nbtpage.c   | 198 +++++++++++++++-----------
 src/backend/access/nbtree/nbtree.c    | 177 ++++++++++++-----------
 src/backend/access/nbtree/nbtsearch.c |   6 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/nbtree/nbtxlog.c   |  39 +++--
 src/backend/access/rmgrdesc/nbtdesc.c |  17 ++-
 src/backend/storage/ipc/standby.c     |  28 ++++
 contrib/amcheck/verify_nbtree.c       |  81 +++++++----
 contrib/pageinspect/btreefuncs.c      |  65 ++++++---
 contrib/pgstattuple/pgstatindex.c     |   8 +-
 15 files changed, 485 insertions(+), 294 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..37d895371a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
  *
  *	In addition, we store the page's btree level (counting upwards from
  *	zero at a leaf page) as well as some flag bits indicating the page type
- *	and status.  If the page is deleted, we replace the level with the
- *	next-transaction-ID value indicating when it is safe to reclaim the page.
+ *	and status.  If the page is deleted, a BTDeletedPageContents struct is
+ *	stored in the page's tuple area, while a standard BTPageOpaqueData struct
+ *	is stored in the page special area.
  *
  *	We also store a "vacuum cycle ID".  When a page is split while VACUUM is
  *	processing the index, a nonzero value associated with the VACUUM run is
@@ -52,17 +53,17 @@ typedef uint16 BTCycleId;
  *
  *	NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
  *	instead.
+ *
+ *	NOTE: the btpo_level field used to be a union type in order to allow
+ *	deleted pages to store a 32-bit safexid in the same field.  We now store
+ *	64-bit/full safexid values using BTDeletedPageContents instead.
  */
 
 typedef struct BTPageOpaqueData
 {
 	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
 	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
-	union
-	{
-		uint32		level;		/* tree level --- zero for leaf pages */
-		TransactionId xact;		/* next transaction ID, if deleted */
-	}			btpo;
+	uint32		btpo_level;		/* tree level --- zero for leaf pages */
 	uint16		btpo_flags;		/* flag bits, see below */
 	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
 } BTPageOpaqueData;
@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_FULLXID	(1 << 8)	/* contains BTDeletedPageContents */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -105,10 +107,12 @@ typedef struct BTMetaPageData
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
 	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
-	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
-										 * pages */
-	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
-													 * during last cleanup */
+
+	/* number of deleted, non-recyclable pages during last cleanup */
+	uint32		btm_last_cleanup_num_delpages;
+	/* number of heap tuples during last cleanup */
+	float8		btm_last_cleanup_num_heap_tuples;
+
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
@@ -220,6 +224,53 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		(((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
 #define P_HAS_GARBAGE(opaque)	(((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
 #define P_INCOMPLETE_SPLIT(opaque)	(((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
+#define P_HAS_FULLXID(opaque)	(((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
+
+/*
+ * BTDeletedPageContents is the page contents of a deleted page
+ */
+typedef struct BTDeletedPageContents
+{
+	/* last xid which might land on the page and get confused */
+	FullTransactionId safexid;
+} BTDeletedPageContents;
+
+static inline void
+BTPageSetDeleted(Page page, FullTransactionId safexid)
+{
+	BTPageOpaque opaque;
+	PageHeader	header;
+	BTDeletedPageContents *contents;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	header = ((PageHeader) page);
+
+	opaque->btpo_flags &= ~BTP_HALF_DEAD;
+	opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
+	header->pd_lower =
+		MAXALIGN(SizeOfPageHeaderData) + sizeof(BTDeletedPageContents);
+	header->pd_upper = header->pd_special;
+
+	/* Set safexid in deleted page */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	contents->safexid = safexid;
+}
+
+static inline FullTransactionId
+BTPageGetDeleteXid(Page page)
+{
+	BTPageOpaque opaque PG_USED_FOR_ASSERTS_ONLY;
+	BTDeletedPageContents *contents;
+
+	/* pg_upgrade'd indexes with old BTP_DELETED pages should not call here */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISDELETED(opaque) && !P_ISHALFDEAD(opaque) &&
+		   P_HAS_FULLXID(opaque));
+
+	/* Get safexid from deleted page */
+	contents = ((BTDeletedPageContents *) PageGetContents(page));
+	return contents->safexid;
+}
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -1067,7 +1118,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
-										 TransactionId oldestBtpoXact, float8 numHeapTuples);
+										 BlockNumber pages_deleted_not_recycled,
+										 float8 num_heap_tuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
@@ -1091,8 +1143,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..0cec80e5fa 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -13,6 +13,7 @@
 #ifndef NBTXLOG_H
 #define NBTXLOG_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/off.h"
@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
 	uint32		level;
 	BlockNumber fastroot;
 	uint32		fastlevel;
-	TransactionId oldest_btpo_xact;
+	uint32		last_cleanup_num_delpages;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
 } xl_btree_metadata;
@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } xl_btree_reuse_page;
 
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
 #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  Note we do
- * not store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about deletion of a btree page.  Note that we
+ * only leave behind a small amount of bookkeeping information in deleted
+ * pages (deleted pages must be kept around as tombstones for a while).  It is
+ * convenient for the REDO routine to regenerate its target page from scratch.
+ * This is why WAL record describes certain details that are actually directly
+ * available from the target page.
  *
  * Backup Blk 0: target block being deleted
  * Backup Blk 1: target block's left sibling, if any
@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
 {
 	BlockNumber leftsib;		/* target block's left sibling, if any */
 	BlockNumber rightsib;		/* target block's right sibling */
+	uint32		level;			/* target block's level */
+	FullTransactionId safexid;	/* target block's BTPageSetDeleted() value */
 
 	/*
-	 * Information needed to recreate the leaf page, when target is an
-	 * internal page.
+	 * Information needed to recreate a half-dead leaf page with correct
+	 * topparent link.  The fields are only used when deletion operation's
+	 * target page is an internal page.  REDO routine creates half-dead page
+	 * from scratch to keep things simple (this is the same convenient
+	 * approach used for the target page itself).
 	 */
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
-	BlockNumber topparent;		/* next child down in the subtree */
+	BlockNumber topparent;
 
-	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
 } xl_btree_unlink_page;
 
-#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, topparent) + sizeof(BlockNumber))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..38fd85a431 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+													   RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f2eda79bc1..1c80eae044 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
-		FullTransactionId nextXid = ReadNextFullTransactionId();
-		uint64		diff;
-
-		/*
-		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
-		 * TransactionIds, so truncate the logged FullTransactionId. If the
-		 * logged value is very old, so that XID wrap-around already happened
-		 * on it, there can't be any snapshots that still see it.
-		 */
-		diff = U64FromFullTransactionId(nextXid) -
-			U64FromFullTransactionId(latestRemovedFullXid);
-		if (diff < MaxTransactionId / 2)
-		{
-			TransactionId latestRemovedXid;
-
-			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
-												xlrec->node);
-		}
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..1f74181f8d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -326,6 +326,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
 			Page		page;
 			BTPageOpaque opaque;
 
+			/* Deliberately omit "!_bt_page_recyclable(page)" Assert() */
 			_bt_checkpage(rel, insertstate->buf);
 			page = BufferGetPage(insertstate->buf);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1241,7 +1242,7 @@ _bt_insertonpg(Relation rel,
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
-			if (metad->btm_fastlevel >= opaque->btpo.level)
+			if (metad->btm_fastlevel >= opaque->btpo_level)
 			{
 				/* no update wanted */
 				_bt_relbuf(rel, metabuf);
@@ -1268,7 +1269,7 @@ _bt_insertonpg(Relation rel,
 			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = BufferGetBlockNumber(buf);
-			metad->btm_fastlevel = opaque->btpo.level;
+			metad->btm_fastlevel = opaque->btpo_level;
 			MarkBufferDirty(metabuf);
 		}
 
@@ -1331,7 +1332,7 @@ _bt_insertonpg(Relation rel,
 					xlmeta.level = metad->btm_level;
 					xlmeta.fastroot = metad->btm_fastroot;
 					xlmeta.fastlevel = metad->btm_fastlevel;
-					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 					xlmeta.last_cleanup_num_heap_tuples =
 						metad->btm_last_cleanup_num_heap_tuples;
 					xlmeta.allequalimage = metad->btm_allequalimage;
@@ -1537,7 +1538,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	/* handle btpo_next after rightpage buffer acquired */
-	lopaque->btpo.level = oopaque->btpo.level;
+	lopaque->btpo_level = oopaque->btpo_level;
 	/* handle btpo_cycleid after rightpage buffer acquired */
 
 	/*
@@ -1722,7 +1723,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = oopaque->btpo_next;
-	ropaque->btpo.level = oopaque->btpo.level;
+	ropaque->btpo_level = oopaque->btpo_level;
 	ropaque->btpo_cycleid = lopaque->btpo_cycleid;
 
 	/*
@@ -1950,7 +1951,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
 
-		xlrec.level = ropaque->btpo.level;
+		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstrightoff = firstrightoff;
 		xlrec.newitemoff = newitemoff;
@@ -2142,7 +2143,7 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2480,15 +2481,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags = BTP_ROOT;
-	rootopaque->btpo.level =
-		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
+	rootopaque->btpo_level =
+		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
 	rootopaque->btpo_cycleid = 0;
 
 	/* update metapage data */
 	metad->btm_root = rootblknum;
-	metad->btm_level = rootopaque->btpo.level;
+	metad->btm_level = rootopaque->btpo_level;
 	metad->btm_fastroot = rootblknum;
-	metad->btm_fastlevel = rootopaque->btpo.level;
+	metad->btm_fastlevel = rootopaque->btpo_level;
 
 	/*
 	 * Insert the left page pointer into the new root page.  The root page is
@@ -2548,7 +2549,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
 		md.fastlevel = metad->btm_level;
-		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+		md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index ac264a5952..f748e72539 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -37,7 +37,7 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   TransactionId latestRemovedXid);
+							   FullTransactionId latestRemovedFullXid);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
@@ -50,7 +50,6 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 TransactionId *oldestBtpoXact,
 									 uint32 *ndeleted);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
@@ -78,7 +77,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_level = level;
 	metad->btm_fastroot = rootbknum;
 	metad->btm_fastlevel = level;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
 
@@ -118,7 +117,7 @@ _bt_upgrademetapage(Page page)
 
 	/* Set version number and fill extra fields added into version 3 */
 	metad->btm_version = BTREE_NOVAC_VERSION;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
@@ -176,8 +175,9 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  *		to those written in the metapage.  On mismatch, metapage is overwritten.
  */
 void
-_bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
-							 float8 numHeapTuples)
+_bt_update_meta_cleanup_info(Relation rel,
+							 BlockNumber pages_deleted_not_free,
+							 float8 num_heap_tuples)
 {
 	Buffer		metabuf;
 	Page		metapg;
@@ -190,11 +190,38 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
+	 * field started out as a TransactionId field called btm_oldest_btpo_xact.
+	 * Both "versions" are just uint32 fields.  _bt_vacuum_needs_cleanup() has
+	 * used both versions to decide when cleanup-only VACUUMs needs to call
+	 * btvacuumscan.  It was convenient to repurpose the field when we began
+	 * to use 64-bit XIDs in deleted pages.
+	 *
+	 * It's possible that a pg_upgrade'd database will contain an XID value in
+	 * what is now recognized as the metapage's btm_last_cleanup_num_delpages
+	 * field.  _bt_vacuum_needs_cleanup() may even believe that this value
+	 * indicates that there are lots of pages that it needs to recycle, when
+	 * in reality there are only one or two.  The worst that can happen is
+	 * that there will be a call to btvacuumscan a little earlier, which will
+	 * end up here -- at which point btm_last_cleanup_num_delpages will
+	 * contain a sane value.
+	 *
+	 * (Besides, this should only happen when there really are some pages that
+	 * we will be able to recycle.  If there are none at all then the metapage
+	 * XID value will be InvalidTransactionId, which is 0 --- we'll manage to
+	 * completely avoid even the minor annoyance of an early btvacuumscan.)
+	 */
+	StaticAssertStmt(sizeof(metad->btm_last_cleanup_num_delpages) ==
+					 sizeof(TransactionId),
+					 "on-disk compatibility assumption violated");
+
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
-	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+	else if (metad->btm_last_cleanup_num_delpages != pages_deleted_not_free)
+		needsRewrite = true;
+	else if (metad->btm_last_cleanup_num_heap_tuples != num_heap_tuples)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -214,8 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
-	metad->btm_oldest_btpo_xact = oldestBtpoXact;
-	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_cleanup_num_delpages = pages_deleted_not_free;
+	metad->btm_last_cleanup_num_heap_tuples = num_heap_tuples;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -232,8 +259,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
-		md.oldest_btpo_xact = oldestBtpoXact;
-		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.last_cleanup_num_delpages = pages_deleted_not_free;
+		md.last_cleanup_num_heap_tuples = num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -316,7 +343,7 @@ _bt_getroot(Relation rel, int access)
 		 * because that's not set in a "fast root".
 		 */
 		if (!P_IGNORE(rootopaque) &&
-			rootopaque->btpo.level == rootlevel &&
+			rootopaque->btpo_level == rootlevel &&
 			P_LEFTMOST(rootopaque) &&
 			P_RIGHTMOST(rootopaque))
 		{
@@ -377,7 +404,7 @@ _bt_getroot(Relation rel, int access)
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 		rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 		rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
-		rootopaque->btpo.level = 0;
+		rootopaque->btpo_level = 0;
 		rootopaque->btpo_cycleid = 0;
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
@@ -393,7 +420,7 @@ _bt_getroot(Relation rel, int access)
 		metad->btm_level = 0;
 		metad->btm_fastroot = rootblkno;
 		metad->btm_fastlevel = 0;
-		metad->btm_oldest_btpo_xact = InvalidTransactionId;
+		metad->btm_last_cleanup_num_delpages = 0;
 		metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
 		MarkBufferDirty(rootbuf);
@@ -416,7 +443,7 @@ _bt_getroot(Relation rel, int access)
 			md.level = 0;
 			md.fastroot = rootblkno;
 			md.fastlevel = 0;
-			md.oldest_btpo_xact = InvalidTransactionId;
+			md.last_cleanup_num_delpages = 0;
 			md.last_cleanup_num_heap_tuples = -1.0;
 			md.allequalimage = metad->btm_allequalimage;
 
@@ -481,11 +508,10 @@ _bt_getroot(Relation rel, int access)
 			rootblkno = rootopaque->btpo_next;
 		}
 
-		/* Note: can't check btpo.level on deleted pages */
-		if (rootopaque->btpo.level != rootlevel)
+		if (rootopaque->btpo_level != rootlevel)
 			elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 				 rootblkno, RelationGetRelationName(rel),
-				 rootopaque->btpo.level, rootlevel);
+				 rootopaque->btpo_level, rootlevel);
 	}
 
 	/*
@@ -585,11 +611,10 @@ _bt_gettrueroot(Relation rel)
 		rootblkno = rootopaque->btpo_next;
 	}
 
-	/* Note: can't check btpo.level on deleted pages */
-	if (rootopaque->btpo.level != rootlevel)
+	if (rootopaque->btpo_level != rootlevel)
 		elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 			 rootblkno, RelationGetRelationName(rel),
-			 rootopaque->btpo.level, rootlevel);
+			 rootopaque->btpo_level, rootlevel);
 
 	return rootbuf;
 }
@@ -762,7 +787,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   FullTransactionId latestRemovedFullXid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -775,7 +801,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedFullXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -815,6 +841,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		buf = ReadBuffer(rel, blkno);
 		_bt_lockbuf(rel, buf, access);
 		_bt_checkpage(rel, buf);
+		Assert(!_bt_page_recyclable(BufferGetPage(buf)));
 	}
 	else
 	{
@@ -862,17 +889,18 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 					 * If we are generating WAL for Hot Standby then create a
 					 * WAL record that will allow us to conflict with queries
 					 * running on standby, in case they have snapshots older
-					 * than btpo.xact.  This can only apply if the page does
-					 * have a valid btpo.xact value, ie not if it's new.  (We
-					 * must check that because an all-zero page has no special
-					 * space.)
+					 * than safexid value returned by BTPageGetDeleteXid().
+					 * This can only apply if the page does have a valid
+					 * safexid value, ie not if it's new.  (We must check that
+					 * because an all-zero page has no special space.)
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
 						!PageIsNew(page))
 					{
-						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+						FullTransactionId latestRemovedFullXid;
 
-						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
+						latestRemovedFullXid = BTPageGetDeleteXid(page);
+						_bt_log_reuse_page(rel, blkno, latestRemovedFullXid);
 					}
 
 					/* Okay to use page.  Re-initialize and return it */
@@ -952,6 +980,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
 	_bt_lockbuf(rel, buf, access);
 
 	_bt_checkpage(rel, buf);
+	Assert(!_bt_page_recyclable(BufferGetPage(buf)));
 	return buf;
 }
 
@@ -1101,9 +1130,31 @@ _bt_page_recyclable(Page page)
 	 * interested in it.
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_ISDELETED(opaque) &&
-		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
-		return true;
+	if (P_ISDELETED(opaque))
+	{
+		/*
+		 * If this is a pg_upgrade'd index, then this could be a deleted page
+		 * whose XID (which is stored in special area's level field via type
+		 * punning) is non-full 32-bit value.  It's safe to just assume that
+		 * we can recycle because the system must have been restarted since
+		 * the time of deletion.
+		 */
+		if (!P_HAS_FULLXID(opaque))
+			return true;
+
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
+	}
+
 	return false;
 }
 
@@ -1768,16 +1819,12 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * that the btvacuumscan scan has yet to reach; they'll get counted later
  * instead.
  *
- * Maintains *oldestBtpoXact for any pages that get deleted.  Caller is
- * responsible for maintaining *oldestBtpoXact in the case of pages that were
- * deleted by a previous VACUUM.
- *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
+_bt_pagedel(Relation rel, Buffer leafbuf)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1985,8 +2032,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, oldestBtpoXact,
-										  &ndeleted))
+										  &rightsib_empty, &ndeleted))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2001,9 +2047,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 			}
 		}
 
-		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
-		Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
-											*oldestBtpoXact));
+		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
+			   P_HAS_FULLXID(opaque));
 
 		rightsib = opaque->btpo_next;
 
@@ -2264,12 +2309,6 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  * containing leafbuf.  (We always set *rightsib_empty for caller, just to be
  * consistent.)
  *
- * We maintain *oldestBtpoXact for pages that are deleted by the current
- * VACUUM operation here.  This must be handled here because we conservatively
- * assume that there needs to be a new call to ReadNewTransactionId() each
- * time a page gets deleted.  See comments about the underlying assumption
- * below.
- *
  * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
  * On success exit, we'll be holding pin and write lock.  On failure exit,
  * we'll release both pin and lock before returning (we define it that way
@@ -2277,8 +2316,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, TransactionId *oldestBtpoXact,
-						 uint32 *ndeleted)
+						 bool *rightsib_empty, uint32 *ndeleted)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2294,12 +2332,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	BTMetaPageData *metad = NULL;
 	ItemId		itemid;
 	Page		page;
-	PageHeader	header;
 	BTPageOpaque opaque;
+	FullTransactionId safexid;
 	bool		rightsib_is_rightmost;
-	int			targetlevel;
+	uint32		targetlevel;
 	IndexTuple	leafhikey;
-	BlockNumber nextchild;
+	BlockNumber topparent_in_target;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2343,7 +2381,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
-		targetlevel = opaque->btpo.level;
+		targetlevel = opaque->btpo_level;
 		Assert(targetlevel > 0);
 
 		/*
@@ -2450,7 +2488,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
-		nextchild = InvalidBlockNumber;
+
+		/* Leaf page is also target page: don't set topparent */
+		topparent_in_target = InvalidBlockNumber;
 	}
 	else
 	{
@@ -2459,11 +2499,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
 
-		/* Remember the next non-leaf child down in the subtree */
+		/* Internal page is target: we'll set topparent in leaf page... */
 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
-		nextchild = BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
-		if (nextchild == leafblkno)
-			nextchild = InvalidBlockNumber;
+		topparent_in_target =
+			BTreeTupleGetTopParent((IndexTuple) PageGetItem(page, itemid));
+		/* ...except when it would be a redundant pointer-to-self */
+		if (topparent_in_target == leafblkno)
+			topparent_in_target = InvalidBlockNumber;
 	}
 
 	/*
@@ -2553,13 +2595,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * no lock was held.
 	 */
 	if (target != leafblkno)
-		BTreeTupleSetTopParent(leafhikey, nextchild);
+		BTreeTupleSetTopParent(leafhikey, topparent_in_target);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
-	 * updated links to the target, ReadNewTransactionId() suffices as an
+	 * updated links to the target, ReadNextFullTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
@@ -2568,17 +2610,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
-	opaque->btpo_flags &= ~BTP_HALF_DEAD;
-	opaque->btpo_flags |= BTP_DELETED;
-	opaque->btpo.xact = ReadNewTransactionId();
 
 	/*
-	 * Remove the remaining tuples on the page.  This keeps things simple for
-	 * WAL consistency checking.
+	 * Store upper bound XID that's used to determine when deleted page is no
+	 * longer needed as a tombstone
 	 */
-	header = (PageHeader) page;
-	header->pd_lower = SizeOfPageHeaderData;
-	header->pd_upper = header->pd_special;
+	safexid = ReadNextFullTransactionId();
+	BTPageSetDeleted(page, safexid);
+	opaque->btpo_cycleid = 0;
 
 	/* And update the metapage, if needed */
 	if (BufferIsValid(metabuf))
@@ -2616,15 +2655,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (target != leafblkno)
 			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
 
-		/* information on the unlinked block */
+		/* information stored on the target/to-be-unlinked block */
 		xlrec.leftsib = leftsib;
 		xlrec.rightsib = rightsib;
-		xlrec.btpo_xact = opaque->btpo.xact;
+		xlrec.level = targetlevel;
+		xlrec.safexid = safexid;
 
 		/* information needed to recreate the leaf block (if not the target) */
 		xlrec.leafleftsib = leafleftsib;
 		xlrec.leafrightsib = leafrightsib;
-		xlrec.topparent = nextchild;
+		xlrec.topparent = topparent_in_target;
 
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
 
@@ -2638,7 +2678,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
 			xlmeta.fastlevel = metad->btm_fastlevel;
-			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+			xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
@@ -2681,9 +2721,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, lbuf);
 	_bt_relbuf(rel, rbuf);
 
-	if (!TransactionIdIsValid(*oldestBtpoXact) ||
-		TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
-		*oldestBtpoXact = opaque->btpo.xact;
+	/* If the target is not leafbuf, we're done with it now -- release it */
+	if (target != leafblkno)
+		_bt_relbuf(rel, buf);
 
 	/*
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
@@ -2693,10 +2733,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		(*ndeleted)++;
 
-	/* If the target is not leafbuf, we're done with it now -- release it */
-	if (target != leafblkno)
-		_bt_relbuf(rel, buf);
-
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..12ad877b70 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber totFreePages;	/* true total # of free pages */
-	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
 } BTVacState;
 
@@ -802,66 +800,87 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		result = false;
+	BTOptions  *relopts;
+	float8		cleanup_scale_factor;
+	uint32		btm_version;
+	BlockNumber prev_pages_deleted_not_free;
+	float8		prev_num_heap_tuples;
 
+	/*
+	 * Copy details from metapage to local variables quickly.
+	 *
+	 * Note that we deliberately avoid using cached version of metapage here.
+	 */
 	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	btm_version = metad->btm_version;
+
+	if (btm_version < BTREE_NOVAC_VERSION)
+	{
+		/*
+		 * Metapage needs to be dynamically upgraded to store fields that are
+		 * only present when btm_version >= BTREE_NOVAC_VERSION
+		 */
+		_bt_relbuf(info->index, metabuf);
+		return true;
+	}
+
+	prev_pages_deleted_not_free = metad->btm_last_cleanup_num_delpages;
+	prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	_bt_relbuf(info->index, metabuf);
 
 	/*
-	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
-	 * aggressive about vacuuming non catalog relations by passing the table
-	 * to GlobalVisCheckRemovableXid().
+	 * If table receives enough insertions and no cleanup was performed, then
+	 * index would appear have stale statistics.  If scale factor is set, we
+	 * avoid that by performing cleanup if the number of inserted tuples
+	 * exceeds vacuum_cleanup_index_scale_factor fraction of original tuples
+	 * count.
 	 */
+	relopts = (BTOptions *) info->index->rd_options;
+	cleanup_scale_factor = (relopts &&
+							relopts->vacuum_cleanup_index_scale_factor >= 0)
+		? relopts->vacuum_cleanup_index_scale_factor
+		: vacuum_cleanup_index_scale_factor;
 
-	if (metad->btm_version < BTREE_NOVAC_VERSION)
-	{
-		/*
-		 * Do cleanup if metapage needs upgrade, because we don't have
-		 * cleanup-related meta-information yet.
-		 */
-		result = true;
-	}
-	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
-	{
-		/*
-		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is visible to everyone, then at least one deleted page can be
-		 * recycled -- don't skip cleanup.
-		 */
-		result = true;
-	}
-	else
-	{
-		BTOptions  *relopts;
-		float8		cleanup_scale_factor;
-		float8		prev_num_heap_tuples;
+	if (cleanup_scale_factor <= 0 ||
+		info->num_heap_tuples < 0 ||
+		prev_num_heap_tuples <= 0 ||
+		(info->num_heap_tuples - prev_num_heap_tuples) /
+		prev_num_heap_tuples >= cleanup_scale_factor)
+		return true;
 
-		/*
-		 * If table receives enough insertions and no cleanup was performed,
-		 * then index would appear have stale statistics.  If scale factor is
-		 * set, we avoid that by performing cleanup if the number of inserted
-		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
-		 * original tuples count.
-		 */
-		relopts = (BTOptions *) info->index->rd_options;
-		cleanup_scale_factor = (relopts &&
-								relopts->vacuum_cleanup_index_scale_factor >= 0)
-			? relopts->vacuum_cleanup_index_scale_factor
-			: vacuum_cleanup_index_scale_factor;
-		prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	/*
+	 * Trigger cleanup in rare cases where prev_pages_deleted_not_free exceeds
+	 * 2.5% of the total size of the index.  We can reasonably expect (though
+	 * are not guaranteed) to be able to recycle this many pages if we decide
+	 * to do a btvacuumscan call as part of this btvacuumcleanup call.
+	 *
+	 * Our approach won't reliably avoid "wasted" cleanup-only btvacuumscan
+	 * calls.  That is, we can end up scanning the entire index without ever
+	 * placing even 1 of the prev_pages_deleted_not_free pages in the free
+	 * space map, at least in certain narrow cases.
+	 *
+	 * For example, a "wasted" scan will happen when (for whatever reason) no
+	 * XIDs were assigned/allocated since the "# deleted pages" field was last
+	 * set in metapage by VACUUM.  You can observe this yourself by running
+	 * two VACUUM VERBOSE commands one after the other on an otherwise idle
+	 * system.  When the first VACUUM command manages to delete pages that
+	 * were emptied + deleted during btbulkdelete, the second VACUUM command
+	 * won't be able to place those same deleted pages (which won't be newly
+	 * deleted from the perspective of the second VACUUM command) into the FSM
+	 * during btvacuumcleanup.
+	 *
+	 * Another "wasted FSM-driven cleanup scan" scenario can occur when
+	 * VACUUM's ability to do work is hindered by a long held MVCC snapshot.
+	 * The snapshot prevents page recycling/freeing within btvacuumscan,
+	 * though that will be the least of the DBA's problems.
+	 */
+	if (prev_pages_deleted_not_free >
+		RelationGetNumberOfBlocks(info->index) / 40)
+		return true;
 
-		if (cleanup_scale_factor <= 0 ||
-			info->num_heap_tuples < 0 ||
-			prev_num_heap_tuples <= 0 ||
-			(info->num_heap_tuples - prev_num_heap_tuples) /
-			prev_num_heap_tuples >= cleanup_scale_factor)
-			result = true;
-	}
-
-	_bt_relbuf(info->index, metabuf);
-	return result;
+	return false;
 }
 
 /*
@@ -973,6 +992,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber num_pages;
 	BlockNumber scanblkno;
 	bool		needLock;
+	BlockNumber pages_deleted_not_free;
 
 	/*
 	 * Reset counts that will be incremented during the scan; needed in case
@@ -981,6 +1001,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_free = 0;
 
 	/* Set up info to pass down to btvacuumpage */
 	vstate.info = info;
@@ -988,8 +1009,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.totFreePages = 0;
-	vstate.oldestBtpoXact = InvalidTransactionId;
 
 	/* Create a temporary memory context to run _bt_pagedel in */
 	vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
@@ -1048,6 +1067,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
+	/* Set statistics num_pages field to final size of index */
+	stats->num_pages = num_pages;
+
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1062,27 +1084,26 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
 	 */
-	if (vstate.totFreePages > 0)
+	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 
 	/*
-	 * Maintain the oldest btpo.xact and a count of the current number of heap
-	 * tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
+	 * Maintain the count of the current number of heap tuples in the
+	 * metapage.  Also maintain the last pages_deleted_not_free.  Both values
+	 * are used within _bt_vacuum_needs_cleanup.
 	 *
-	 * The page with the oldest btpo.xact is typically a page deleted by this
-	 * VACUUM operation, since pages deleted by a previous VACUUM operation
-	 * tend to be placed in the FSM (by the current VACUUM operation) -- such
-	 * pages are not candidates to be the oldest btpo.xact.  (Note that pages
-	 * placed in the FSM are reported as deleted pages in the bulk delete
-	 * statistics, despite not counting as deleted pages for the purposes of
-	 * determining the oldest btpo.xact.)
+	 * pages_deleted_not_free is the number of deleted pages now in the index
+	 * that were not safe to place in the FSM to be recycled just yet.  We
+	 * expect that it will almost certainly be possible to place all of these
+	 * pages in the FSM during the next VACUUM operation.  If the next VACUUM
+	 * operation happens to be cleanup-only, _bt_vacuum_needs_cleanup will be
+	 * called.  We may decide to proceed with a call to btvacuumscan purely
+	 * because there are lots of deleted pages not yet placed in the FSM.
 	 */
-	_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+	Assert(stats->pages_deleted >= stats->pages_free);
+	pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+	_bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
 								 info->num_heap_tuples);
-
-	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
 }
 
 /*
@@ -1122,9 +1143,10 @@ backtrack:
 
 	/*
 	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * _bt_checkpage(), which will barf on an all-zero page.  We want to
+	 * recycle all-zero pages, not fail.  Similarly, must avoid _bt_getbuf()
+	 * because it Assert()s that any existing page must be non-recyclable.
+	 * And, we want to use a buffer access strategy here.
 	 */
 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 							 info->strategy);
@@ -1193,8 +1215,8 @@ backtrack:
 	{
 		/* Okay to recycle this page (which could be leaf or internal) */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->totFreePages++;
 		stats->pages_deleted++;
+		stats->pages_free++;
 	}
 	else if (P_ISDELETED(opaque))
 	{
@@ -1203,17 +1225,12 @@ backtrack:
 		 * recycle yet.
 		 */
 		stats->pages_deleted++;
-
-		/* Maintain the oldest btpo.xact */
-		if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
-			TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
-			vstate->oldestBtpoXact = opaque->btpo.xact;
 	}
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * oldestBtpoXact and pages_deleted below.
+		 * pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1430,7 +1447,7 @@ backtrack:
 		 * count.  There will be no double-counting.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
+		stats->pages_deleted += _bt_pagedel(rel, buf);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2e3bda8171..d1177d8772 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
-		if (opaque->btpo.level == 1 && access == BT_WRITE)
+		if (opaque->btpo_level == 1 && access == BT_WRITE)
 			page_access = BT_WRITE;
 
 		/* drop the read lock on the page, then acquire one on its child */
@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 		}
 
 		/* Done? */
-		if (opaque->btpo.level == level)
+		if (opaque->btpo_level == level)
 			break;
-		if (opaque->btpo.level < level)
+		if (opaque->btpo_level < level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("btree level %u not found in index \"%s\"",
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5683daa34d..2c4d7f6e25 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
 	/* Initialize BT opaque state */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_prev = opaque->btpo_next = P_NONE;
-	opaque->btpo.level = level;
+	opaque->btpo_level = level;
 	opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
 	opaque->btpo_cycleid = 0;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..b6afe9526e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_fastlevel = xlrec->fastlevel;
 	/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
-	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
+	md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
 
@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
 
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = spagenumber;
-	ropaque->btpo.level = xlrec->level;
+	ropaque->btpo_level = xlrec->level;
 	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = xlrec->leftblk;
 	pageop->btpo_next = xlrec->rightblk;
-	pageop->btpo.level = 0;
+	pageop->btpo_level = 0;
 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
+	uint32		level;
+	bool		isleaf;
+	FullTransactionId safexid;
 	Buffer		leftbuf;
 	Buffer		target;
 	Buffer		rightbuf;
@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	leftsib = xlrec->leftsib;
 	rightsib = xlrec->rightsib;
+	level = xlrec->level;
+	isleaf = (level == 0);
+	safexid = xlrec->safexid;
+
+	/* No topparent link for leaf page (level 0) or level 1 */
+	Assert(xlrec->topparent == InvalidBlockNumber || level > 1);
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = leftsib;
 	pageop->btpo_next = rightsib;
-	pageop->btpo.xact = xlrec->btpo_xact;
-	pageop->btpo_flags = BTP_DELETED;
-	if (!BlockNumberIsValid(xlrec->topparent))
+	pageop->btpo_level = level;
+	BTPageSetDeleted(page, safexid);
+	if (isleaf)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		Buffer				leafbuf;
 		IndexTupleData		trunctuple;
 
+		Assert(!isleaf);
+
 		leafbuf = XLogInitBufferForRedo(record, 3);
 		page = (Page) BufferGetPage(leafbuf);
 
@@ -901,7 +912,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 		pageop->btpo_prev = xlrec->leafleftsib;
 		pageop->btpo_next = xlrec->leafrightsib;
-		pageop->btpo.level = 0;
+		pageop->btpo_level = 0;
 		pageop->btpo_cycleid = 0;
 
 		/* Add a dummy hikey item */
@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
 
 	pageop->btpo_flags = BTP_ROOT;
 	pageop->btpo_prev = pageop->btpo_next = P_NONE;
-	pageop->btpo.level = xlrec->level;
+	pageop->btpo_level = xlrec->level;
 	if (xlrec->level == 0)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
@@ -972,17 +983,15 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The
-	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisCheckRemovableFullXid(deleteXid) test in _bt_page_recyclable()
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..5cce10a5b6 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -80,9 +80,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
 
-				appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
-								 xlrec->leftsib, xlrec->rightsib,
-								 xlrec->btpo_xact);
+				appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
+								 xlrec->leftsib, xlrec->rightsib, xlrec->level,
+								 EpochFromFullTransactionId(xlrec->safexid),
+								 XidFromFullTransactionId(xlrec->safexid));
 				appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
 								 xlrec->leafleftsib, xlrec->leafrightsib,
 								 xlrec->topparent);
@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
 
-				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
+				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->latestRemovedXid);
+								 xlrec->node.relNode,
+								 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+								 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 				break;
 			}
 		case XLOG_BTREE_META_CLEANUP:
@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
-								 xlrec->oldest_btpo_xact,
+				appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
+								 xlrec->last_cleanup_num_delpages,
 								 xlrec->last_cleanup_num_heap_tuples);
 				break;
 			}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0eeb766943 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+/*
+ * Variant of ResolveRecoveryConflictWithSnapshot that works with
+ * FullTransactionId values
+ */
+void
+ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+										   RelFileNode node)
+{
+	/*
+	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
+	 * so truncate the logged FullTransactionId.  If the logged value is very
+	 * old, so that XID wrap-around already happened on it, there can't be any
+	 * snapshots that still see it.
+	 */
+	FullTransactionId nextXid = ReadNextFullTransactionId();
+	uint64			  diff;
+
+	diff = U64FromFullTransactionId(nextXid) -
+		U64FromFullTransactionId(latestRemovedFullXid);
+	if (diff < MaxTransactionId / 2)
+	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+	}
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b8c7793d9e..0272c6554f 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 											  P_FIRSTDATAKEY(opaque));
 				itup = (IndexTuple) PageGetItem(state->target, itemid);
 				nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
-				nextleveldown.level = opaque->btpo.level - 1;
+				nextleveldown.level = opaque->btpo_level - 1;
 			}
 			else
 			{
@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 		if (opaque->btpo_prev != leftcurrent)
 			bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
 
-		/* Check level, which must be valid for non-ignorable page */
-		if (level.level != opaque->btpo.level)
+		/* Check level */
+		if (level.level != opaque->btpo_level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										current, level.level, opaque->btpo.level)));
+										current, level.level, opaque->btpo_level)));
 
 		/* Verify invariants for page */
 		bt_target_page_check(state);
@@ -1167,7 +1167,7 @@ bt_target_page_check(BtreeCheckState *state)
 				bt_child_highkey_check(state,
 									   offset,
 									   NULL,
-									   topaque->btpo.level);
+									   topaque->btpo_level);
 			}
 			continue;
 		}
@@ -1529,7 +1529,7 @@ bt_target_page_check(BtreeCheckState *state)
 	if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
 	{
 		bt_child_highkey_check(state, InvalidOffsetNumber,
-							   NULL, topaque->btpo.level);
+							   NULL, topaque->btpo_level);
 	}
 }
 
@@ -1606,7 +1606,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 		ereport(DEBUG1,
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
-						opaque->btpo.level, RelationGetRelationName(state->rel)),
+						opaque->btpo_level, RelationGetRelationName(state->rel)),
 				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
 
 		/* Be slightly more pro-active in freeing this memory, just in case */
@@ -1910,14 +1910,15 @@ bt_child_highkey_check(BtreeCheckState *state,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 
-		/* Check level for non-ignorable page */
-		if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
+		/* Do level sanity check */
+		if ((!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque)) &&
+			opaque->btpo_level != target_level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										blkno, target_level - 1, opaque->btpo.level)));
+										blkno, target_level - 1, opaque->btpo_level)));
 
 		/* Try to detect circular links */
 		if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
@@ -2145,7 +2146,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
 	 * check for downlink connectivity.
 	 */
 	bt_child_highkey_check(state, downlinkoffnum,
-						   child, topaque->btpo.level);
+						   child, topaque->btpo_level);
 
 	/*
 	 * Since there cannot be a concurrent VACUUM operation in readonly mode,
@@ -2290,7 +2291,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 				 errmsg("harmless interrupted page split detected in index %s",
 						RelationGetRelationName(state->rel)),
 				 errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
-									blkno, opaque->btpo.level,
+									blkno, opaque->btpo_level,
 									opaque->btpo_prev,
 									(uint32) (pagelsn >> 32),
 									(uint32) pagelsn)));
@@ -2321,7 +2322,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 	elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
 		 RelationGetRelationName(state->rel));
 
-	level = opaque->btpo.level;
+	level = opaque->btpo_level;
 	itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
 	itup = (IndexTuple) PageGetItem(page, itemid);
 	childblk = BTreeTupleGetDownLink(itup);
@@ -2336,16 +2337,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			break;
 
 		/* Do an extra sanity check in passing on internal pages */
-		if (copaque->btpo.level != level - 1)
+		if (copaque->btpo_level != level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
 									 RelationGetRelationName(state->rel)),
 					 errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
 										blkno, childblk,
-										level - 1, copaque->btpo.level)));
+										level - 1, copaque->btpo_level)));
 
-		level = copaque->btpo.level;
+		level = copaque->btpo_level;
 		itemid = PageGetItemIdCareful(state, childblk, child,
 									  P_FIRSTDATAKEY(copaque));
 		itup = (IndexTuple) PageGetItem(child, itemid);
@@ -2407,7 +2408,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			 errmsg("internal index block lacks downlink in index \"%s\"",
 					RelationGetRelationName(state->rel)),
 			 errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
-								blkno, opaque->btpo.level,
+								blkno, opaque->btpo_level,
 								(uint32) (pagelsn >> 32),
 								(uint32) pagelsn)));
 }
@@ -3002,21 +3003,26 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	}
 
 	/*
-	 * Deleted pages have no sane "level" field, so can only check non-deleted
-	 * page level
+	 * Deleted pages that still use the old 32-bit XID representation have no
+	 * sane "level" field because they type pun the field, but all other pages
+	 * (including pages deleted on Postgres 14+) have a valid value.
 	 */
-	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
-						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+	if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
+	{
+		/* Okay, no reason not to trust btpo_level field from page */
 
-	if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
-		opaque->btpo.level == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
-						blocknum, RelationGetRelationName(state->rel))));
+		if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+							opaque->btpo_level, blocknum, RelationGetRelationName(state->rel))));
+
+		if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+							blocknum, RelationGetRelationName(state->rel))));
+	}
 
 	/*
 	 * Sanity checks for number of items on page.
@@ -3064,7 +3070,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * from version 9.4 on, so do the same here.  See _bt_pagedel() for full
 	 * details.
 	 *
-	 * Internal pages should never have garbage items, either.
+	 * Also check that internal pages have no garbage items, and that no page
+	 * has an invalid combination of page deletion related page level flags.
 	 */
 	if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
 		ereport(ERROR,
@@ -3079,6 +3086,18 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 				 errmsg("internal page block %u in index \"%s\" has garbage items",
 						blocknum, RelationGetRelationName(state->rel))));
 
+	if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
+								 blocknum, RelationGetRelationName(state->rel))));
+
+	if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
+								 blocknum, RelationGetRelationName(state->rel))));
+
 	return page;
 }
 
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..dfac1a9716 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -75,11 +75,7 @@ typedef struct BTPageStat
 	/* opaque data */
 	BlockNumber btpo_prev;
 	BlockNumber btpo_next;
-	union
-	{
-		uint32		level;
-		TransactionId xact;
-	}			btpo;
+	uint32		btpo_level;
 	uint16		btpo_flags;
 	BTCycleId	btpo_cycleid;
 } BTPageStat;
@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* page type (flags) */
 	if (P_ISDELETED(opaque))
 	{
-		stat->type = 'd';
-		stat->btpo.xact = opaque->btpo.xact;
-		return;
+		/* We divide deleted pages into leaf ('d') or internal ('D') */
+		if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
+			stat->type = 'd';
+		else
+			stat->type = 'D';
+
+		/*
+		 * Report safexid in a deleted page.
+		 *
+		 * Handle pg_upgrade'd deleted pages that used the previous safexid
+		 * representation in btpo_level field (this used to be a union type
+		 * called "bpto").
+		 */
+		if (P_HAS_FULLXID(opaque))
+		{
+			FullTransactionId safexid = BTPageGetDeleteXid(page);
+
+			elog(NOTICE, "deleted page from block %u has safexid %u:%u",
+				 blkno, EpochFromFullTransactionId(safexid),
+				 XidFromFullTransactionId(safexid));
+		}
+		else
+			elog(NOTICE, "deleted page from block %u has safexid %u",
+				 blkno, opaque->btpo_level);
+
+		/* Don't interpret BTDeletedPageContents as index tuples */
+		maxoff = InvalidOffsetNumber;
 	}
 	else if (P_IGNORE(opaque))
 		stat->type = 'e';
@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* btpage opaque data */
 	stat->btpo_prev = opaque->btpo_prev;
 	stat->btpo_next = opaque->btpo_next;
-	stat->btpo.level = opaque->btpo.level;
+	stat->btpo_level = opaque->btpo_level;
 	stat->btpo_flags = opaque->btpo_flags;
 	stat->btpo_cycleid = opaque->btpo_cycleid;
 
@@ -237,7 +257,8 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 	values[j++] = psprintf("%u", stat.free_size);
 	values[j++] = psprintf("%u", stat.btpo_prev);
 	values[j++] = psprintf("%u", stat.btpo_next);
-	values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
+	/* The "btpo" field now only stores btpo_level, never an xact */
+	values[j++] = psprintf("%u", stat.btpo_level);
 	values[j++] = psprintf("%d", stat.btpo_flags);
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
@@ -503,10 +524,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
 
-		if (P_ISDELETED(opaque))
-			elog(NOTICE, "page is deleted");
-
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -603,7 +628,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		if (P_ISDELETED(opaque))
 			elog(NOTICE, "page is deleted");
 
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageContents as index tuples */
+			elog(NOTICE, "page from block is deleted");
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -723,7 +755,8 @@ bt_metap(PG_FUNCTION_ARGS)
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
+		/* XXX: btm_last_cleanup_num_delpages used to be btm_oldest_btpo_xact */
+		values[j++] = psprintf("%u", metad->btm_last_cleanup_num_delpages);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index b1ce0d77d7..5368bb30f0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 		page = BufferGetPage(buffer);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		/* Determine page type, and update totals */
-
+		/*
+		 * Determine page type, and update totals.
+		 *
+		 * Note that we arbitrarily bucket deleted pages together without
+		 * considering if they're leaf pages or internal pages.
+		 */
 		if (P_ISDELETED(opaque))
 			indexStat.deleted_pages++;
 		else if (P_IGNORE(opaque))
-- 
2.27.0

v3-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchapplication/octet-stream; name=v3-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchDownload

From 138cbd7810fb4967e480bb49c06524c4e2f18fb5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 7 Feb 2021 19:24:03 -0800
Subject: [PATCH v3 2/2] Add pages_newly_deleted to VACUUM VERBOSE.

pages_newly_deleted reports on the number of pages deleted by the
current VACUUM operation.  The pages_deleted field continues to report
on the total number of deleted pages in the index (as well as pages that
are recyclable due to being zeroed in rare cases), without regard to
whether or not this VACUUM operation deleted them.
---
 src/include/access/genam.h            | 12 ++++----
 src/include/access/nbtree.h           |  3 +-
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  |  4 ++-
 src/backend/access/heap/vacuumlazy.c  |  6 ++--
 src/backend/access/nbtree/nbtpage.c   | 43 ++++++++++++++++++---------
 src/backend/access/nbtree/nbtree.c    | 20 +++++++++----
 src/backend/access/spgist/spgvacuum.c |  1 +
 8 files changed, 61 insertions(+), 29 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..d29c0a8cbb 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -64,10 +64,11 @@ typedef struct IndexVacuumInfo
  * to communicate additional private data to amvacuumcleanup.
  *
  * Note: pages_removed is the amount by which the index physically shrank,
- * if any (ie the change in its total size on disk).  pages_deleted and
- * pages_free refer to free space within the index file.  Some index AMs
- * may compute num_index_tuples by reference to num_heap_tuples, in which
- * case they should copy the estimated_count field from IndexVacuumInfo.
+ * if any (ie the change in its total size on disk).  pages_deleted,
+ * pages_newly_deleted, and pages_free refer to free space within the index
+ * file.  Some index AMs may compute num_index_tuples by reference to
+ * num_heap_tuples, in which case they should copy the estimated_count field
+ * from IndexVacuumInfo.
  */
 typedef struct IndexBulkDeleteResult
 {
@@ -76,7 +77,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
+	BlockNumber pages_newly_deleted;	/* # pages marked deleted by us  */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 37d895371a..344fa59092 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1143,7 +1143,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
+						  BlockNumber *ndeletedcount);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..7504f57a03 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -232,6 +232,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	END_CRIT_SECTION();
 
 	gvs->result->pages_deleted++;
+	gvs->result->pages_newly_deleted++;
 }
 
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..14023e08a6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -139,6 +139,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_free = 0;
 
 	/*
@@ -281,8 +282,8 @@ restart:
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->stats->pages_free++;
 		vstate->stats->pages_deleted++;
+		vstate->stats->pages_free++;
 	}
 	else if (GistPageIsDeleted(page))
 	{
@@ -640,6 +641,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
 	stats->pages_deleted++;
+	stats->pages_newly_deleted++;
 
 	/* remove the downlink from the parent */
 	MarkBufferDirty(parentBuffer);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..addf243e40 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,10 +2521,12 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages have been deleted, %u are newly deleted, %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (*stats)->pages_deleted,
+						   (*stats)->pages_newly_deleted,
+						   (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f748e72539..2e86d64432 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BlockNumber *ndeletedcount);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1813,18 +1813,31 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Returns the number of pages successfully physically deleted (zero if page
+ * cannot be deleted now; could be more than one if parent or right sibling
+ * pages were deleted too).  Caller uses return value to maintain bulk stats'
+ * pages_newly_deleted value.
+ *
+ * Maintains *ndeletedcount for caller, which is ultimately used as the
+ * pages_deleted value in bulk delete stats for entire VACUUM.  When we're
+ * called *ndeletedcount is the current total count of pages deleted in the
+ * index prior to current scanblkno block/position in btvacuumscan.  This
+ * includes existing deleted pages (pages deleted by a previous VACUUM), and
+ * pages that we delete during current VACUUM.  We need to cooperate closely
+ * with caller here so that whole VACUUM operation reliably avoids any double
+ * counting of subsidiary-to-leafbuf pages that we delete in passing.  If such
+ * pages happen to be from a block number that is ahead of the current
+ * scanblkno position, then caller is expected to count them directly later
+ * on.  It's simpler for us to understand caller's requirements than it would
+ * be for caller to understand when or how a deleted page became deleted after
+ * the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+_bt_pagedel(Relation rel, Buffer leafbuf, BlockNumber *ndeletedcount)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1834,7 +1847,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are accounted for by *ndeletedcount.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -2032,7 +2045,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, ndeletedcount))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2045,6 +2058,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				Assert(false);
 				return ndeleted;
 			}
+			ndeleted++;
+			/* _bt_unlink_halfdead_page probably incremented ndeletedcount */
 		}
 
 		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
@@ -2316,7 +2331,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BlockNumber *ndeletedcount)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2726,12 +2741,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
-	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * Maintain *ndeletedcount, per _bt_pagedel() header comments.  This is
+	 * how _bt_pagedel() helps the entire VACUUM operation with maintaining
+	 * pages_deleted field from the bulk delete stats.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		(*ndeletedcount)++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 12ad877b70..604538297d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1001,6 +1001,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_free = 0;
 
 	/* Set up info to pass down to btvacuumpage */
@@ -1229,8 +1230,8 @@ backtrack:
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
-		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * Half-dead leaf page.  Try to delete now.  Might end up incrementing
+		 * pages_newly_deleted/pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1442,12 +1443,19 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel return value is simply the number of pages directly
+		 * deleted on each call.  This is just added to pages_newly_deleted,
+		 * which counts the number of pages marked deleted in current VACUUM.
+		 *
+		 * We need to maintain pages_deleted more carefully here, though.  We
+		 * cannot just add the same _bt_pagedel return value to pages_deleted
+		 * because that double-counts pages that are deleted within
+		 * _bt_pagedel that will become scanblkno in a later call here from
+		 * btvacuumscan.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		stats->pages_newly_deleted +=
+			_bt_pagedel(rel, buf, &stats->pages_deleted);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#7)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Feb 10, 2021 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

I forget to mention that v3 adds several assertions like this one:

Assert(!_bt_page_recyclable(BufferGetPage(buf)));

These appear at a few key points inside generic routines like
_bt_getbuf(). The overall effect is that every nbtree buffer access
(with the exception of buffer accesses by VACUUM) will make sure that
the page that they're about to access is not recyclable (a page that
an index scan lands on might be half-dead or deleted, but it had
better not be recyclable).

This can probably catch problems with recycling pages too early, such
as the problem fixed by commit d3abbbeb back in 2012. Any similar bugs
in this area that may appear in the future can be expected to be very
subtle, for a few reasons. For one, a page can be recyclable but not
yet entered into the FSM by VACUUM for a long time. (I could go on.)

The assertions dramatically improve our chances of catching problems
like that early.

--
Peter Geoghegan

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#7)

Re: 64-bit XIDs in deleted nbtree pages

On Thu, Feb 11, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 10, 2021 at 2:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for working on this!

I'm glad that I finally found time for it! It seems like it'll make
things easier elsewhere.

Attached is v3 of the index. I'll describe the changes I made in more
detail in my response to your points below.

I agree that btm_oldest_btpo_xact will no longer be necessary in terms
of recycling deleted pages.

Cool.

Given that we can guarantee that deleted pages never be leaked by
using 64-bit XID, I also think we don't need this value. We can do
amvacuumcleanup only if the table receives enough insertions to update
the statistics (i.g., vacuum_cleanup_index_scale_factor check). I
think this is a more desirable behavior. Not skipping amvacuumcleanup
if there is even one deleted page that we can recycle is very
conservative.

Considering your idea of keeping newly deleted pages in the meta page,
I can see a little value that keeping btm_oldest_btpo_xact and making
it 64-bit XID. I described below.

Interesting.

I like this idea that triggers btvacuumscan() if there are many newly
deleted pages. I think this would be helpful especially for the case
of bulk-deletion on the table. But why we use the number of *newly*
deleted pages but not the total number of deleted pages in the index?

I was unclear here -- I should not have said "newly deleted" pages at
all. What I actually do when calling _bt_vacuum_needs_cleanup() is
this (from v3, at the end of btvacuumscan()):
-   _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+   Assert(stats->pages_deleted >= stats->pages_free);
+   pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+   _bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
info->num_heap_tuples);
We're actually passing something I have called
"pages_deleted_not_free" here, which is derived from the bulk delete
stats in the obvious way that you see here (subtraction). I'm not
using pages_newly_deleted at all now. Note also that the behavior
inside _bt_update_meta_cleanup_info() no longer varies based on
whether it is called during btvacuumcleanup() or during btbulkdelete()
-- the same rules apply either way. We want to store
pages_deleted_not_free in the metapage at the end of btvacuumscan(),
no matter what.

This same pages_deleted_not_free information is now used by
_bt_vacuum_needs_cleanup() in an obvious and simple way: if it's too
high (over 2.5%), then that will trigger a call to btbulkdelete() (we
won't skip scanning the index). Though in practice it probably won't
come up that often -- there just aren't ever that many deleted pages
in most indexes.

Thanks for your explanation. That makes sense to me.

IIUC if several btbulkdelete executions deleted index pages less than
2% of the index and those deleted pages could not be recycled yet,
then the number of recyclable pages would exceed 2% of the index in
total but amvacuumcleanup() would not trigger btvacuumscan() because
the last newly deleted pages are less than the 2% threshold. I might
be missing something though.

I think you're right -- my idea of varying the behavior of
_bt_update_meta_cleanup_info() based on whether it's being called
during btvacuumcleanup() or during btbulkdelete() was a bad idea (FWIW
half the problem was that I explained the idea badly to begin with).
But, as I said, it's fixed in v3: we simply pass
"pages_deleted_not_free" as an argument to _bt_vacuum_needs_cleanup()
now.

Does that make sense? Does it address this concern?

Yes!

Also, we need to note that having newly deleted pages doesn't
necessarily mean these always are recyclable at that time. If the
global xmin is still older than deleted page's btpo.xact values, we
still could not recycle them. I think btm_oldest_btpo_xact probably
will help this case. That is, we store the oldest btpo.xact among
those newly deleted pages to btm_oldest_btpo_xact and we trigger
btvacuumscan() if there are many newly deleted pages (more than 2% of
index) and the btm_oldest_btpo_xact is older than the global xmin (I
suppose each newly deleted pages could have different btpo.xact).

I agree that having no XID in the metapage creates a new small
problem. Specifically, there are certain narrow cases that can cause
confusion in _bt_vacuum_needs_cleanup(). These cases didn't really
exist before my patch (kind of).

The simplest example is easy to run into when debugging the patch on
your laptop. Because you're using your personal laptop, and not a real
production server, there will be no concurrent sessions that might
consume XIDs. You can run VACUUM VERBOSE manually several times, but
that alone will never be enough to enable VACUUM to recycle any of the
pages that the first VACUUM manages to delete (many to mark deleted,
reporting the pages as "newly deleted" via the new instrumentation
from the second patch). Note that the master branch is *also* unable
to recycle these deleted pages, simply because the "safe xid" never
gets old because there are no newly allocated XIDs to make it look old
(there are no allocated XIDs just because nothing else happens). That
in itself is not the new problem.

The new problem is that _bt_vacuum_needs_cleanup() will no longer
notice that the oldest XID among deleted-but-not-yet-recycled pages is
so old that it will not be able to recycle the pages anyway -- at
least not the oldest page, though in this specific case that will
apply to all deleted pages equally. We might as well not bother trying
yet, which the old code "gets right" -- but it doesn't get it right
for any good reason. That is, the old code won't have VACUUM scan the
index at all, so it "wins" in this specific scenario.

I'm on the same page.

I think that's okay, though -- it's not a real problem, and actually
makes sense and has other advantages. This is why I believe it's okay:

* We really should never VACUUM the same table before even one or two
XIDs are allocated -- that's what happens in the simple laptop test
scenario that I described. Surely we should not be too concerned about
"doing the right thing" under this totally artificial set of
conditions.

Right.

(BTW, I've been using txid_current() for my own "laptop testing", as a
way to work around this issue.)

* More generally, if you really can't do recycling of pages that you
deleted during the last VACUUM during this VACUUM (perhaps because of
the presence of a long-running xact that holds open a snapshot), then
you have lots of *huge* problems already, and this is the least of
your concerns. Besides, at that point an affected VACUUM will be doing
work for an affected index through a btbulkdelete() call, so the
behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

* As you pointed out already, the oldest XID/deleted page from the
index may be significantly older than the newest. Why should we bucket
them together?

I agree with this point.

We could easily have a case where most of the deleted pages can be
recycled -- even when all indexes were originally marked deleted by
the same VACUUM operation. If there are lots of pages that actually
can be recycled, it is probably a bad thing to assume that the oldest
XID is representative of all of them. After all, with the patch we
only go out of our way to recycle deleted pages when we are almost
sure that the total number of recyclable pages (pages marked deleted
during a previous VACUUM) exceeds 2.5% of the total size of the index.
That broad constraint is important here -- if we do nothing unless
there are lots of deleted pages anyway, we are highly unlikely to ever
err on the side of being too eager (not eager enough seems more likely
to me).

I think that we're justified in making a general assumption inside
_bt_vacuum_needs_cleanup() (which is documented at the point that we
call it, inside btvacuumscan()): The assumption that however many
index pages the metapage says we'll be able to recycle (whatever the
field says) will in fact turn out to be recyclable if we decide that
we need to. There are specific cases where that will be kind of wrong,
as I've gone into, but the assumption/design has many more advantages
than disadvantages.

I have tried to capture this in v3 of the patch. Can you take a look?
See the new comments inside _bt_vacuum_needs_cleanup(). Plus the
comments when we call it inside btvacuumscan().

I basically agreed with the change made in v3 patch. But I think it's
probably worth having a discussion on append-only table cases with
autovacuums triggered by
autovacuum_vacuum_insert_scale_factor/threshold.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#10

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#9)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

I see your point.

This will only not be a problem with the old code because the oldest
XID in the metapage happens to restrict VACUUM in what turns out to be
exactly perfect. But why assume that? It's actually rather unlikely
that we won't be able to free even one block, even in this scenario.
The oldest XID isn't truly special -- at least not without the
restrictions that go with 32-bit XIDs.

The other thing is that vacuum_cleanup_index_scale_factor is mostly
about limiting how long we'll go before having stale statistics, and
so presumably the user gets the benefit of not having stale statistics
(maybe that theory is a bit questionable in some cases, but that
doesn't have all that much to do with page deletion -- in fact the
problem exists without page deletion ever occuring).

BTW, I am thinking about making recycling take place for pages that
were deleted during the same VACUUM. We can just use a
work_mem-limited array to remember a list of blocks that are deleted
but not yet recyclable (plus the XID found in the block). At the end
of the VACUUM, (just before calling IndexFreeSpaceMapVacuum() from
within btvacuumscan()), we can then determine which blocks are now
safe to recycle, and recycle them after all using some "late" calls to
RecordFreeIndexPage() (and without revisiting the pages a second
time). No need to wait for the next VACUUM to recycle pages this way,
at least in many common cases. The reality is that it usually doesn't
take very long for a deleted page to become recyclable -- why wait?

This idea is enabled by commit c79f6df75dd from 2018. I think it's the
next logical step.

--
Peter Geoghegan

#11

Victor Yegorov

vyegorov@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#9)

Re: 64-bit XIDs in deleted nbtree pages

сб, 13 февр. 2021 г. в 05:39, Masahiko Sawada <sawada.mshk@gmail.com>:

(BTW, I've been using txid_current() for my own "laptop testing", as a
way to work around this issue.)

* More generally, if you really can't do recycling of pages that you
deleted during the last VACUUM during this VACUUM (perhaps because of
the presence of a long-running xact that holds open a snapshot), then
you have lots of *huge* problems already, and this is the least of
your concerns. Besides, at that point an affected VACUUM will be doing
work for an affected index through a btbulkdelete() call, so the
behavior of _bt_vacuum_needs_cleanup() becomes irrelevant.

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

I'd like to outline one relevant case.

Quite often bulk deletes are done on a time series data (oldest) and
effectively
removes a continuous chunk of data at the (physical) beginning of the table,
this is especially true for the append-only tables.
After the delete, planning queries takes a long time, due to MergeJoin
estimates
are using IndexScans ( see
/messages/by-id/17467.1426090533@sss.pgh.pa.us )
Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate
this.

So I would, actually, like it very much for VACUUM to kick in sooner in
such cases.

--
Victor Yegorov

#12

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Victor Yegorov (#11)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 12, 2021 at 10:27 PM Victor Yegorov <vyegorov@gmail.com> wrote:

I'd like to outline one relevant case.

Quite often bulk deletes are done on a time series data (oldest) and effectively
removes a continuous chunk of data at the (physical) beginning of the table,
this is especially true for the append-only tables.
After the delete, planning queries takes a long time, due to MergeJoin estimates
are using IndexScans ( see /messages/by-id/17467.1426090533@sss.pgh.pa.us )
Right now we have to disable MergeJoins via the ALTER SYSTEM to mitigate this.

So I would, actually, like it very much for VACUUM to kick in sooner in such cases.

Masahiko was specifically concerned about workloads with
bursty/uneven/mixed VACUUM triggering conditions -- he mentioned
autovacuum_vacuum_insert_scale_factor/threshold as being applied to
trigger a second VACUUM (which follows from an initial VACUUM that
performs deletions following a bulk DELETE).

A VACUUM that needs to delete index tuples will do its btvacuumscan()
through the btbulkdelete() path, not through the btvacuumcleanup()
"cleanup only" path. The btbulkdelete() path won't ever call
_bt_vacuum_needs_cleanup() in the first place, and so there can be no
risk that the relevant changes (changes that the patch makes to that
function) will have some new bad effect. The problem that you have
described seems very real, but it doesn't seem relevant to the
specific scenario that Masahiko expressed concern about. Nor does it
seem relevant to this patch more generally.

--
Peter Geoghegan

#13

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#10)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

I see your point.

My guess is that this concern of yours is somehow related to how we do
deletion and recycling *in general*. Currently (and even in v3 of the
patch), we assume that recycling the pages that a VACUUM operation
deletes will happen "eventually". This kind of makes sense when you
have "typical vacuuming" -- deletes/updates, and no big bursts, rare
bulk deletes, etc.

But when you do have a mixture of different triggering positions,
which is quite possible, it is difficult to understand what
"eventually" actually means...

BTW, I am thinking about making recycling take place for pages that
were deleted during the same VACUUM. We can just use a
work_mem-limited array to remember a list of blocks that are deleted
but not yet recyclable (plus the XID found in the block).

...which brings me back to this idea.

I've prototyped this. It works really well. In most cases the
prototype makes VACUUM operations with nbtree index page deletions
also recycle the pages that were deleted, at the end of the
btvacuumscan(). We do very little or no "indefinite deferring" work
here. This has obvious advantages, of course, but it also has a
non-obvious advantage: the awkward question of concerning "what
eventually actually means" with mixed triggering conditions over time
mostly goes away. So perhaps this actually addresses your concern,
Masahiko.

I've been testing this with BenchmarkSQL [1]https://github.com/wieck/benchmarksql, which has several
indexes that regularly need page deletions. There is also a realistic
"life cycle" to the data in these indexes. I added custom
instrumentation to display information about what's going on with page
deletion when the benchmark is run. I wrote a quick-and-dirty patch
that makes log_autovacuum show the same information that you see about
index page deletion when VACUUM VERBOSE is run (including the new
pages_newly_deleted field from my patch). With this particular
TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
on to place every page that it deletes in the FSM without leaving
anything to the next VACUUM. There are a very small number of
exceptions where we "only" manage to recycle maybe 95% of the pages
that were deleted.

The race condition that nbtree avoids by deferring recycling was
always a narrow one, outside of the extremes -- the way we defer has
always been overkill. It's almost always unnecessary to delay placing
deleted pages in the FSM until the *next* VACUUM. We only have to
delay it until the end of the *same* VACUUM -- why wait until the next
VACUUM if we don't have to? In general this deferring recycling
business has nothing to do with MVCC/GC/whatever, and yet the code
seems to suggest that it does. While it is convenient to use an XID
for page deletion and recycling as a way of implementing what Lanin &
Shasha call "the drain technique" [2]https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up -- see "2.5 Freeing Empty Nodes" -- Peter Geoghegan, all we have to do is prevent
certain race conditions. This is all about the index itself, the data
structure, how it is maintained -- nothing more. It almost seems
obvious to me.

It's still possible to imagine extremes. Extremes that even the "try
to recycle pages we ourselves deleted when we reach the end of
btvacuumscan()" version of my patch cannot deal with. Maybe it really
is true that it's inherently impossible to recycle a deleted page even
at the end of a VACUUM -- maybe a long-running transaction (that could
in principle have a stale link to our deleted page) starts before we
VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
that happens, we're back to having the original problem: we're relying
on some *future* VACUUM operation to do that for us at some indefinite
point in the future. It's fair to wonder: What are the implications of
that? Are we not back to square one? Don't we have the same "what does
'eventually' really mean" problem once again?

I think that that's okay, because this remaining case is a *truly*
extreme case (especially with a large index, where index vacuuming
will naturally take a long time).

It will be rare. But more importantly, the fact that scenario is now
an extreme case justifies treating it as an extreme case. We can teach
_bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
particular, I think that it will now be okay to increase the threshold
applied when considering deleted pages inside
_bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
patch. But in v4, which has the new recycling enhancement, I think
that it would be sensible to make it 5%, or maybe even 10%. This
naturally makes Masahiko's problem scenario unlikely to actually
result in a truly wasted call to btvacuumscan(). The number of pages
that the metapage indicates are "deleted but not yet placed in the
FSM" will be close to the theoretical minimum, because we're no longer
naively throwing away information about which specific pages will be
recyclable soon. Which is what the current approach does, really.

[1]: https://github.com/wieck/benchmarksql
[2]: https://archive.org/stream/symmetricconcurr00lani#page/8/mode/2up -- see "2.5 Freeing Empty Nodes" -- Peter Geoghegan
-- see "2.5 Freeing Empty Nodes"
--
Peter Geoghegan

#14

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#13)

3 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Sat, Feb 13, 2021 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

It will be rare. But more importantly, the fact that scenario is now
an extreme case justifies treating it as an extreme case. We can teach
_bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
particular, I think that it will now be okay to increase the threshold
applied when considering deleted pages inside
_bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
patch. But in v4, which has the new recycling enhancement, I think
that it would be sensible to make it 5%, or maybe even 10%. This
naturally makes Masahiko's problem scenario unlikely to actually
result in a truly wasted call to btvacuumscan().

Attached is v4, which has the "recycle pages that we ourselves deleted
during this same VACUUM operation" enhancement. It also doubles the
_bt_vacuum_needs_cleanup() threshold applied to deleted pages -- it
goes from 2.5% to 5%. The new patch is the patch series (v4-0002-*)
certainly needs more polishing. I'm posting what I have now because v3
has bitrot.

Benchmarking has shown that the enhancement in v4-0002-* can
significantly reduce the amount of index bloat in two of the
BenchmarkSQL/TPC-C indexes.

--
Peter Geoghegan

Attachments:

v4-0002-Recycle-pages-deleted-during-same-VACUUM.patchapplication/octet-stream; name=v4-0002-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 4f8076e5dd749d98eb74d0dfa6cfb127ded1dba0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 7 Feb 2021 19:24:03 -0800
Subject: [PATCH v4 2/3] Recycle pages deleted during same VACUUM.

TODO: Respect work_mem in temporary space that remembers the details of
pages deleted during current VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         | 25 ++++++++++-
 src/backend/access/nbtree/nbtpage.c | 61 +++++++++++++++----------
 src/backend/access/nbtree/nbtree.c  | 69 ++++++++++++++++++++++-------
 3 files changed, 115 insertions(+), 40 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 44d176255a..0e81ff8356 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -272,6 +272,29 @@ BTPageGetDeleteXid(Page page)
 	return contents->safexid;
 }
 
+/*
+ * BTVacState is nbtree.c state used during VACUUM.  It is exported for use by
+ * page deletion related code in nbtpage.c.
+ */
+typedef struct BTPendingRecycle
+{
+	BlockNumber		  blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
+typedef struct BTVacState
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	BTCycleId	cycleid;
+	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	uint32			 sizedeleted;
+	uint32			 ndeleted;
+} BTVacState;
+
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
  *	page.  The high key is not a tuple that is used to visit the heap.  It is
@@ -1142,7 +1165,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern void _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e0d1e585f6..408230cb67 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BTVacState *vstate);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1763,20 +1763,22 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Maintains bulk delete stats for caller, which are taken from vstate.  We
+ * need to cooperate closely with caller here so that whole VACUUM operation
+ * reliably avoids any double counting of subsidiary-to-leafbuf pages that we
+ * delete in passing.  If such pages happen to be from a block number that is
+ * ahead of the current scanblkno position, then caller is expected to count
+ * them directly later on.  It's simpler for us to understand caller's
+ * requirements than it would be for caller to understand when or how a
+ * deleted page became deleted after the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
-uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+void
+_bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 {
-	uint32		ndeleted = 0;
 	BlockNumber rightsib;
 	bool		rightsib_empty;
 	Page		page;
@@ -1784,7 +1786,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are added to bulk delete stat's pages_deleted
+	 * count.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -1846,7 +1849,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 										 RelationGetRelationName(rel))));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1876,7 +1879,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			Assert(!P_ISHALFDEAD(opaque));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1925,8 +1928,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
-					Assert(ndeleted == 0);
-					return ndeleted;
+					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
@@ -1967,7 +1969,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -1982,7 +1984,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, vstate))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -1993,7 +1995,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				 * lock and pin on leafbuf for us.
 				 */
 				Assert(false);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -2030,8 +2032,6 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
-
-	return ndeleted;
 }
 
 /*
@@ -2266,9 +2266,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BTVacState *vstate)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	IndexBulkDeleteResult *stats = vstate->stats;
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
 	BlockNumber target;
@@ -2676,12 +2677,24 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
-	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * Maintain pages_deleted in a way that takes into account how
+	 * btvacuumpage() will count deleted pages that have yet to become
+	 * scanblkno -- only count page when it's not going to get that treatment
+	 * later on.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		stats->pages_deleted++;
+
+	if (vstate->ndeleted >= vstate->sizedeleted)
+	{
+		vstate->sizedeleted *= 2;
+		vstate->deleted =
+				repalloc(vstate->deleted,
+						 sizeof(BTPendingRecycle) * vstate->sizedeleted);
+	}
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1e6da9439f..c3b32bb71c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,23 +32,13 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 
 
-/* Working state needed by btvacuumpage */
-typedef struct
-{
-	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
-	IndexBulkDeleteCallback callback;
-	void	   *callback_state;
-	BTCycleId	cycleid;
-	MemoryContext pagedelcontext;
-} BTVacState;
-
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
@@ -920,6 +910,34 @@ _bt_page_recyclable(BTPageOpaque opaque, Page page)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+
+	/* Recompute VACUUM XID boundaries */
+	(void) GetOldestNonRemovableTransactionId(NULL);
+
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber		  blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(NULL, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -1054,6 +1072,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.sizedeleted = 16384;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.sizedeleted);
+	vstate.ndeleted = 0;
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1122,7 +1145,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 
@@ -1260,6 +1294,13 @@ backtrack:
 		/*
 		 * Already deleted page (which could be leaf or internal).  Can't
 		 * recycle yet.
+		 *
+		 * This is a deleted page that must have been deleted in a previous
+		 * VACUUM operation, that nevertheless cannot be recycled now.  There
+		 * is no good reason to expect that to change any time soon, which is
+		 * why it isn't among the pages that _bt_newly_deleted_pages_recycle
+		 * will consider as candidates to recycle at the end of btvacuumscan
+		 * call.
 		 */
 		stats->pages_deleted++;
 	}
@@ -1479,12 +1520,10 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel maintains the bulk delete stats on our behalf
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		_bt_pagedel(rel, buf, vstate);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
-- 
2.27.0

v4-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchapplication/octet-stream; name=v4-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchDownload

From 4c3293892b28a4bd0447363840fd18626378d081 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v4 1/3] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing within GiST.

Stop storing an XID for that tracks the oldest safexid across all
deleted pages in an index altogether.  There is no longer any point in
doing this.  It only ever made sense when bpto.xact fields could
wraparound.

The old btm_oldest_btpo_xact metapage field has been repurposed in a way
that preserves on-disk compatibility for pg_upgrade.  Rename this uint32
field, and use it to store the number of deleted pages that we expect to
be able to recycle during the next btvacuumcleanup() that actually scans
the index.  This approach is a little unorthodox, but we were already
using btm_oldest_btpo_xact (now called btm_last_cleanup_num_delpages) in
approximately the same way. And in exactly the same place: inside the
_bt_vacuum_needs_cleanup() function.

The general assumption is that we ought to be able to recycle however
many pages btm_last_cleanup_num_delpages indicates by deciding to scan
the index during a btvacuumcleanup() call (_bt_vacuum_needs_cleanup()'s
decision).  Note that manually issued VACUUMs won't be able to recycle
btm_last_cleanup_num_delpages pages (and _bt_vacuum_needs_cleanup()
won't instruct btvacuumcleanup() to skip scanning the index) unless at
least one XID is consumed between VACUUMs.

Bump XLOG_PAGE_MAGIC due to WAL record switch over to full XIDs.
---
 src/include/access/nbtree.h                   |  80 ++++--
 src/include/access/nbtxlog.h                  |  28 ++-
 src/include/storage/standby.h                 |   2 +
 src/backend/access/gist/gistxlog.c            |  24 +-
 src/backend/access/nbtree/nbtinsert.c         |  24 +-
 src/backend/access/nbtree/nbtpage.c           | 228 ++++++++----------
 src/backend/access/nbtree/nbtree.c            | 200 +++++++++------
 src/backend/access/nbtree/nbtsearch.c         |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   2 +-
 src/backend/access/nbtree/nbtxlog.c           |  39 +--
 src/backend/access/rmgrdesc/nbtdesc.c         |  17 +-
 src/backend/storage/ipc/standby.c             |  28 +++
 contrib/amcheck/verify_nbtree.c               |  81 ++++---
 contrib/pageinspect/btreefuncs.c              |  69 ++++--
 contrib/pageinspect/expected/btree.out        |  20 +-
 contrib/pageinspect/pageinspect--1.8--1.9.sql |  17 ++
 contrib/pgstattuple/pgstatindex.c             |   8 +-
 doc/src/sgml/pageinspect.sgml                 |  20 +-
 18 files changed, 541 insertions(+), 352 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..44d176255a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
  *
  *	In addition, we store the page's btree level (counting upwards from
  *	zero at a leaf page) as well as some flag bits indicating the page type
- *	and status.  If the page is deleted, we replace the level with the
- *	next-transaction-ID value indicating when it is safe to reclaim the page.
+ *	and status.  If the page is deleted, a BTDeletedPageData struct is stored
+ *	in the page's tuple area, while a standard BTPageOpaqueData struct is
+ *	stored in the page special area.
  *
  *	We also store a "vacuum cycle ID".  When a page is split while VACUUM is
  *	processing the index, a nonzero value associated with the VACUUM run is
@@ -52,17 +53,17 @@ typedef uint16 BTCycleId;
  *
  *	NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
  *	instead.
+ *
+ *	NOTE: the btpo_level field used to be a union type in order to allow
+ *	deleted pages to store a 32-bit safexid in the same field.  We now store
+ *	64-bit/full safexid values using BTDeletedPageData instead.
  */
 
 typedef struct BTPageOpaqueData
 {
 	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
 	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
-	union
-	{
-		uint32		level;		/* tree level --- zero for leaf pages */
-		TransactionId xact;		/* next transaction ID, if deleted */
-	}			btpo;
+	uint32		btpo_level;		/* tree level --- zero for leaf pages */
 	uint16		btpo_flags;		/* flag bits, see below */
 	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
 } BTPageOpaqueData;
@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_FULLXID	(1 << 8)	/* contains BTDeletedPageData */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -105,10 +107,12 @@ typedef struct BTMetaPageData
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
 	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
-	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
-										 * pages */
-	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
-													 * during last cleanup */
+
+	/* number of deleted, non-recyclable pages during last cleanup */
+	uint32		btm_last_cleanup_num_delpages;
+	/* number of heap tuples during last cleanup */
+	float8		btm_last_cleanup_num_heap_tuples;
+
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
@@ -220,6 +224,53 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		(((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
 #define P_HAS_GARBAGE(opaque)	(((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
 #define P_INCOMPLETE_SPLIT(opaque)	(((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
+#define P_HAS_FULLXID(opaque)	(((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
+
+/*
+ * BTDeletedPageData is the page contents of a deleted page
+ */
+typedef struct BTDeletedPageData
+{
+	/* last xid which might land on the page and get confused */
+	FullTransactionId safexid;
+} BTDeletedPageData;
+
+static inline void
+BTPageSetDeleted(Page page, FullTransactionId safexid)
+{
+	BTPageOpaque opaque;
+	PageHeader	header;
+	BTDeletedPageData *contents;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	header = ((PageHeader) page);
+
+	opaque->btpo_flags &= ~BTP_HALF_DEAD;
+	opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
+	header->pd_lower =
+		MAXALIGN(SizeOfPageHeaderData) + sizeof(BTDeletedPageData);
+	header->pd_upper = header->pd_special;
+
+	/* Set safexid in deleted page */
+	contents = ((BTDeletedPageData *) PageGetContents(page));
+	contents->safexid = safexid;
+}
+
+static inline FullTransactionId
+BTPageGetDeleteXid(Page page)
+{
+	BTPageOpaque opaque PG_USED_FOR_ASSERTS_ONLY;
+	BTDeletedPageData *contents;
+
+	/* pg_upgrade'd indexes with old BTP_DELETED pages should not call here */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISDELETED(opaque) && !P_ISHALFDEAD(opaque) &&
+		   P_HAS_FULLXID(opaque));
+
+	/* Get safexid from deleted page */
+	contents = ((BTDeletedPageData *) PageGetContents(page));
+	return contents->safexid;
+}
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -1067,7 +1118,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
 extern void _bt_update_meta_cleanup_info(Relation rel,
-										 TransactionId oldestBtpoXact, float8 numHeapTuples);
+										 BlockNumber pages_deleted_not_recycled,
+										 float8 num_heap_tuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
@@ -1084,15 +1136,13 @@ extern void _bt_unlockbuf(Relation rel, Buffer buf);
 extern bool _bt_conditionallockbuf(Relation rel, Buffer buf);
 extern void _bt_upgradelockbufcleanup(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
-extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..0cec80e5fa 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -13,6 +13,7 @@
 #ifndef NBTXLOG_H
 #define NBTXLOG_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/off.h"
@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
 	uint32		level;
 	BlockNumber fastroot;
 	uint32		fastlevel;
-	TransactionId oldest_btpo_xact;
+	uint32		last_cleanup_num_delpages;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
 } xl_btree_metadata;
@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } xl_btree_reuse_page;
 
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
 #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  Note we do
- * not store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about deletion of a btree page.  Note that we
+ * only leave behind a small amount of bookkeeping information in deleted
+ * pages (deleted pages must be kept around as tombstones for a while).  It is
+ * convenient for the REDO routine to regenerate its target page from scratch.
+ * This is why WAL record describes certain details that are actually directly
+ * available from the target page.
  *
  * Backup Blk 0: target block being deleted
  * Backup Blk 1: target block's left sibling, if any
@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
 {
 	BlockNumber leftsib;		/* target block's left sibling, if any */
 	BlockNumber rightsib;		/* target block's right sibling */
+	uint32		level;			/* target block's level */
+	FullTransactionId safexid;	/* target block's BTPageSetDeleted() value */
 
 	/*
-	 * Information needed to recreate the leaf page, when target is an
-	 * internal page.
+	 * Information needed to recreate a half-dead leaf page with correct
+	 * topparent link.  The fields are only used when deletion operation's
+	 * target page is an internal page.  REDO routine creates half-dead page
+	 * from scratch to keep things simple (this is the same convenient
+	 * approach used for the target page itself).
 	 */
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
-	BlockNumber topparent;		/* next child down in the subtree */
+	BlockNumber topparent;
 
-	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
 } xl_btree_unlink_page;
 
-#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, topparent) + sizeof(BlockNumber))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..38fd85a431 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+													   RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f2eda79bc1..1c80eae044 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
-		FullTransactionId nextXid = ReadNextFullTransactionId();
-		uint64		diff;
-
-		/*
-		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
-		 * TransactionIds, so truncate the logged FullTransactionId. If the
-		 * logged value is very old, so that XID wrap-around already happened
-		 * on it, there can't be any snapshots that still see it.
-		 */
-		diff = U64FromFullTransactionId(nextXid) -
-			U64FromFullTransactionId(latestRemovedFullXid);
-		if (diff < MaxTransactionId / 2)
-		{
-			TransactionId latestRemovedXid;
-
-			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
-												xlrec->node);
-		}
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..1edb9f9579 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
-			if (metad->btm_fastlevel >= opaque->btpo.level)
+			if (metad->btm_fastlevel >= opaque->btpo_level)
 			{
 				/* no update wanted */
 				_bt_relbuf(rel, metabuf);
@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
 			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = BufferGetBlockNumber(buf);
-			metad->btm_fastlevel = opaque->btpo.level;
+			metad->btm_fastlevel = opaque->btpo_level;
 			MarkBufferDirty(metabuf);
 		}
 
@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel,
 					xlmeta.level = metad->btm_level;
 					xlmeta.fastroot = metad->btm_fastroot;
 					xlmeta.fastlevel = metad->btm_fastlevel;
-					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 					xlmeta.last_cleanup_num_heap_tuples =
 						metad->btm_last_cleanup_num_heap_tuples;
 					xlmeta.allequalimage = metad->btm_allequalimage;
@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	/* handle btpo_next after rightpage buffer acquired */
-	lopaque->btpo.level = oopaque->btpo.level;
+	lopaque->btpo_level = oopaque->btpo_level;
 	/* handle btpo_cycleid after rightpage buffer acquired */
 
 	/*
@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = oopaque->btpo_next;
-	ropaque->btpo.level = oopaque->btpo.level;
+	ropaque->btpo_level = oopaque->btpo_level;
 	ropaque->btpo_cycleid = lopaque->btpo_cycleid;
 
 	/*
@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
 
-		xlrec.level = ropaque->btpo.level;
+		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstrightoff = firstrightoff;
 		xlrec.newitemoff = newitemoff;
@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags = BTP_ROOT;
-	rootopaque->btpo.level =
-		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
+	rootopaque->btpo_level =
+		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
 	rootopaque->btpo_cycleid = 0;
 
 	/* update metapage data */
 	metad->btm_root = rootblknum;
-	metad->btm_level = rootopaque->btpo.level;
+	metad->btm_level = rootopaque->btpo_level;
 	metad->btm_fastroot = rootblknum;
-	metad->btm_fastlevel = rootopaque->btpo.level;
+	metad->btm_fastlevel = rootopaque->btpo_level;
 
 	/*
 	 * Insert the left page pointer into the new root page.  The root page is
@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
 		md.fastlevel = metad->btm_level;
-		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+		md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8c326a4774..e0d1e585f6 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -37,7 +37,7 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   TransactionId latestRemovedXid);
+							   FullTransactionId latestRemovedFullXid);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
@@ -50,7 +50,6 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 TransactionId *oldestBtpoXact,
 									 uint32 *ndeleted);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
@@ -78,7 +77,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_level = level;
 	metad->btm_fastroot = rootbknum;
 	metad->btm_fastlevel = level;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
 
@@ -118,7 +117,7 @@ _bt_upgrademetapage(Page page)
 
 	/* Set version number and fill extra fields added into version 3 */
 	metad->btm_version = BTREE_NOVAC_VERSION;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
@@ -176,13 +175,14 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  *		to those written in the metapage.  On mismatch, metapage is overwritten.
  */
 void
-_bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
-							 float8 numHeapTuples)
+_bt_update_meta_cleanup_info(Relation rel,
+							 BlockNumber pages_deleted_not_free,
+							 float8 num_heap_tuples)
 {
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		needsRewrite = false;
+	bool		rewrite = false;
 	XLogRecPtr	recptr;
 
 	/* read the metapage and check if it needs rewrite */
@@ -190,14 +190,41 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
+	/*
+	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
+	 * field started out as a TransactionId field called btm_oldest_btpo_xact.
+	 * Both "versions" are just uint32 fields.  _bt_vacuum_needs_cleanup() has
+	 * used both versions to decide when cleanup-only VACUUMs needs to call
+	 * btvacuumscan.  It was convenient to repurpose the field when we began
+	 * to use 64-bit XIDs in deleted pages.
+	 *
+	 * It's possible that a pg_upgrade'd database will contain an XID value in
+	 * what is now recognized as the metapage's btm_last_cleanup_num_delpages
+	 * field.  _bt_vacuum_needs_cleanup() may even believe that this value
+	 * indicates that there are lots of pages that it needs to recycle, when
+	 * in reality there are only one or two.  The worst that can happen is
+	 * that there will be a call to btvacuumscan a little earlier, which will
+	 * end up here -- at which point btm_last_cleanup_num_delpages will
+	 * contain a sane value.
+	 *
+	 * (Besides, this should only happen when there really are some pages that
+	 * we will be able to recycle.  If there are none at all then the metapage
+	 * XID value will be InvalidTransactionId, which is 0 --- we'll manage to
+	 * completely avoid even the minor annoyance of an early btvacuumscan.)
+	 */
+	StaticAssertStmt(sizeof(metad->btm_last_cleanup_num_delpages) ==
+					 sizeof(TransactionId),
+					 "on-disk compatibility assumption violated");
+
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
-		needsRewrite = true;
-	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
-		needsRewrite = true;
+		rewrite = true;
+	else if (metad->btm_last_cleanup_num_delpages != pages_deleted_not_free)
+		rewrite = true;
+	else if (metad->btm_last_cleanup_num_heap_tuples != num_heap_tuples)
+		rewrite = true;
 
-	if (!needsRewrite)
+	if (!rewrite)
 	{
 		_bt_relbuf(rel, metabuf);
 		return;
@@ -214,8 +241,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
-	metad->btm_oldest_btpo_xact = oldestBtpoXact;
-	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_cleanup_num_delpages = pages_deleted_not_free;
+	metad->btm_last_cleanup_num_heap_tuples = num_heap_tuples;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -232,8 +259,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
-		md.oldest_btpo_xact = oldestBtpoXact;
-		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.last_cleanup_num_delpages = pages_deleted_not_free;
+		md.last_cleanup_num_heap_tuples = num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -316,7 +343,7 @@ _bt_getroot(Relation rel, int access)
 		 * because that's not set in a "fast root".
 		 */
 		if (!P_IGNORE(rootopaque) &&
-			rootopaque->btpo.level == rootlevel &&
+			rootopaque->btpo_level == rootlevel &&
 			P_LEFTMOST(rootopaque) &&
 			P_RIGHTMOST(rootopaque))
 		{
@@ -377,7 +404,7 @@ _bt_getroot(Relation rel, int access)
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 		rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 		rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
-		rootopaque->btpo.level = 0;
+		rootopaque->btpo_level = 0;
 		rootopaque->btpo_cycleid = 0;
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
@@ -393,7 +420,7 @@ _bt_getroot(Relation rel, int access)
 		metad->btm_level = 0;
 		metad->btm_fastroot = rootblkno;
 		metad->btm_fastlevel = 0;
-		metad->btm_oldest_btpo_xact = InvalidTransactionId;
+		metad->btm_last_cleanup_num_delpages = 0;
 		metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
 		MarkBufferDirty(rootbuf);
@@ -416,7 +443,7 @@ _bt_getroot(Relation rel, int access)
 			md.level = 0;
 			md.fastroot = rootblkno;
 			md.fastlevel = 0;
-			md.oldest_btpo_xact = InvalidTransactionId;
+			md.last_cleanup_num_delpages = 0;
 			md.last_cleanup_num_heap_tuples = -1.0;
 			md.allequalimage = metad->btm_allequalimage;
 
@@ -481,11 +508,10 @@ _bt_getroot(Relation rel, int access)
 			rootblkno = rootopaque->btpo_next;
 		}
 
-		/* Note: can't check btpo.level on deleted pages */
-		if (rootopaque->btpo.level != rootlevel)
+		if (rootopaque->btpo_level != rootlevel)
 			elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 				 rootblkno, RelationGetRelationName(rel),
-				 rootopaque->btpo.level, rootlevel);
+				 rootopaque->btpo_level, rootlevel);
 	}
 
 	/*
@@ -585,11 +611,10 @@ _bt_gettrueroot(Relation rel)
 		rootblkno = rootopaque->btpo_next;
 	}
 
-	/* Note: can't check btpo.level on deleted pages */
-	if (rootopaque->btpo.level != rootlevel)
+	if (rootopaque->btpo_level != rootlevel)
 		elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 			 rootblkno, RelationGetRelationName(rel),
-			 rootopaque->btpo.level, rootlevel);
+			 rootopaque->btpo_level, rootlevel);
 
 	return rootbuf;
 }
@@ -762,7 +787,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   FullTransactionId latestRemovedFullXid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -775,7 +801,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedFullXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -855,36 +881,45 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 			buf = ReadBuffer(rel, blkno);
 			if (_bt_conditionallockbuf(rel, buf))
 			{
+				BTPageOpaque opaque;
+
+				/* Check for all-zero page first */
 				page = BufferGetPage(buf);
-				if (_bt_page_recyclable(page))
+				if (PageIsNew(page))
+				{
+					/* Okay to use page.  Initialize and return it. */
+					_bt_pageinit(page, BufferGetPageSize(buf));
+					return buf;
+				}
+
+				opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+				if (P_ISDELETED(opaque))
 				{
 					/*
 					 * If we are generating WAL for Hot Standby then create a
 					 * WAL record that will allow us to conflict with queries
 					 * running on standby, in case they have snapshots older
-					 * than btpo.xact.  This can only apply if the page does
-					 * have a valid btpo.xact value, ie not if it's new.  (We
-					 * must check that because an all-zero page has no special
-					 * space.)
+					 * than safexid value
 					 */
 					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
-						!PageIsNew(page))
+						P_HAS_FULLXID(opaque))
 					{
-						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+						FullTransactionId latestRemovedFullXid;
 
-						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
+						latestRemovedFullXid = BTPageGetDeleteXid(page);
+						_bt_log_reuse_page(rel, blkno, latestRemovedFullXid);
 					}
 
-					/* Okay to use page.  Re-initialize and return it */
+					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
 					return buf;
 				}
-				elog(DEBUG2, "FSM returned nonrecyclable page");
+				elog(DEBUG1, "FSM returned nonrecyclable page");
 				_bt_relbuf(rel, buf);
 			}
 			else
 			{
-				elog(DEBUG2, "FSM returned nonlockable page");
+				elog(DEBUG1, "FSM returned nonlockable page");
 				/* couldn't get lock, so just drop pin */
 				ReleaseBuffer(buf);
 			}
@@ -1073,40 +1108,6 @@ _bt_pageinit(Page page, Size size)
 	PageInit(page, size, sizeof(BTPageOpaqueData));
 }
 
-/*
- *	_bt_page_recyclable() -- Is an existing page recyclable?
- *
- * This exists to make sure _bt_getbuf and btvacuumscan have the same
- * policy about whether a page is safe to re-use.  But note that _bt_getbuf
- * knows enough to distinguish the PageIsNew condition from the other one.
- * At some point it might be appropriate to redesign this to have a three-way
- * result value.
- */
-bool
-_bt_page_recyclable(Page page)
-{
-	BTPageOpaque opaque;
-
-	/*
-	 * It's possible to find an all-zeroes page in an index --- for example, a
-	 * backend might successfully extend the relation one page and then crash
-	 * before it is able to make a WAL entry for adding the page. If we find a
-	 * zeroed page then reclaim it.
-	 */
-	if (PageIsNew(page))
-		return true;
-
-	/*
-	 * Otherwise, recycle if deleted and too old to have any processes
-	 * interested in it.
-	 */
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_ISDELETED(opaque) &&
-		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
-		return true;
-	return false;
-}
-
 /*
  * Delete item(s) from a btree leaf page during VACUUM.
  *
@@ -1768,16 +1769,12 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * that the btvacuumscan scan has yet to reach; they'll get counted later
  * instead.
  *
- * Maintains *oldestBtpoXact for any pages that get deleted.  Caller is
- * responsible for maintaining *oldestBtpoXact in the case of pages that were
- * deleted by a previous VACUUM.
- *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
+_bt_pagedel(Relation rel, Buffer leafbuf)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1985,8 +1982,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, oldestBtpoXact,
-										  &ndeleted))
+										  &rightsib_empty, &ndeleted))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2001,9 +1997,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 			}
 		}
 
-		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
-		Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
-											*oldestBtpoXact));
+		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque) &&
+			   P_HAS_FULLXID(opaque));
 
 		rightsib = opaque->btpo_next;
 
@@ -2264,12 +2259,6 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  * containing leafbuf.  (We always set *rightsib_empty for caller, just to be
  * consistent.)
  *
- * We maintain *oldestBtpoXact for pages that are deleted by the current
- * VACUUM operation here.  This must be handled here because we conservatively
- * assume that there needs to be a new call to ReadNextTransactionId() each
- * time a page gets deleted.  See comments about the underlying assumption
- * below.
- *
  * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
  * On success exit, we'll be holding pin and write lock.  On failure exit,
  * we'll release both pin and lock before returning (we define it that way
@@ -2277,8 +2266,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, TransactionId *oldestBtpoXact,
-						 uint32 *ndeleted)
+						 bool *rightsib_empty, uint32 *ndeleted)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2294,12 +2282,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	BTMetaPageData *metad = NULL;
 	ItemId		itemid;
 	Page		page;
-	PageHeader	header;
 	BTPageOpaque opaque;
+	FullTransactionId safexid;
 	bool		rightsib_is_rightmost;
-	int			targetlevel;
+	uint32		targetlevel;
 	IndexTuple	leafhikey;
-	BlockNumber nextchild;
+	BlockNumber topparent_in_target;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2343,7 +2331,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
-		targetlevel = opaque->btpo.level;
+		targetlevel = opaque->btpo_level;
 		Assert(targetlevel > 0);
 
 		/*
@@ -2450,7 +2438,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
-		nextchild = InvalidBlockNumber;
+
+		/* Leaf page is also target page: don't set topparent */
+		topparent_in_target = InvalidBlockNumber;
 	}
 	else
 	{
@@ -2459,11 +2449,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
 
-		/* Remember the next non-leaf child down in the subtree */
+		/* Internal page is target: we'll set topparent in leaf page... */
 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
-		nextchild = BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
-		if (nextchild == leafblkno)
-			nextchild = InvalidBlockNumber;
+		topparent_in_target =
+			BTreeTupleGetTopParent((IndexTuple) PageGetItem(page, itemid));
+		/* ...except when it would be a redundant pointer-to-self */
+		if (topparent_in_target == leafblkno)
+			topparent_in_target = InvalidBlockNumber;
 	}
 
 	/*
@@ -2553,13 +2545,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * no lock was held.
 	 */
 	if (target != leafblkno)
-		BTreeTupleSetTopParent(leafhikey, nextchild);
+		BTreeTupleSetTopParent(leafhikey, topparent_in_target);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
-	 * updated links to the target, ReadNextTransactionId() suffices as an
+	 * updated links to the target, ReadNextFullTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
@@ -2568,17 +2560,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
-	opaque->btpo_flags &= ~BTP_HALF_DEAD;
-	opaque->btpo_flags |= BTP_DELETED;
-	opaque->btpo.xact = ReadNextTransactionId();
 
 	/*
-	 * Remove the remaining tuples on the page.  This keeps things simple for
-	 * WAL consistency checking.
+	 * Store upper bound XID that's used to determine when deleted page is no
+	 * longer needed as a tombstone
 	 */
-	header = (PageHeader) page;
-	header->pd_lower = SizeOfPageHeaderData;
-	header->pd_upper = header->pd_special;
+	safexid = ReadNextFullTransactionId();
+	BTPageSetDeleted(page, safexid);
+	opaque->btpo_cycleid = 0;
 
 	/* And update the metapage, if needed */
 	if (BufferIsValid(metabuf))
@@ -2616,15 +2605,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (target != leafblkno)
 			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
 
-		/* information on the unlinked block */
+		/* information stored on the target/to-be-unlinked block */
 		xlrec.leftsib = leftsib;
 		xlrec.rightsib = rightsib;
-		xlrec.btpo_xact = opaque->btpo.xact;
+		xlrec.level = targetlevel;
+		xlrec.safexid = safexid;
 
 		/* information needed to recreate the leaf block (if not the target) */
 		xlrec.leafleftsib = leafleftsib;
 		xlrec.leafrightsib = leafrightsib;
-		xlrec.topparent = nextchild;
+		xlrec.topparent = topparent_in_target;
 
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
 
@@ -2638,7 +2628,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
 			xlmeta.fastlevel = metad->btm_fastlevel;
-			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+			xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
@@ -2681,9 +2671,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, lbuf);
 	_bt_relbuf(rel, rbuf);
 
-	if (!TransactionIdIsValid(*oldestBtpoXact) ||
-		TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
-		*oldestBtpoXact = opaque->btpo.xact;
+	/* If the target is not leafbuf, we're done with it now -- release it */
+	if (target != leafblkno)
+		_bt_relbuf(rel, buf);
 
 	/*
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
@@ -2693,10 +2683,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		(*ndeleted)++;
 
-	/* If the target is not leafbuf, we're done with it now -- release it */
-	if (target != leafblkno)
-		_bt_relbuf(rel, buf);
-
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..1e6da9439f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber totFreePages;	/* true total # of free pages */
-	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
 } BTVacState;
 
@@ -802,66 +800,124 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		result = false;
+	BTOptions  *relopts;
+	float8		cleanup_scale_factor;
+	uint32		btm_version;
+	BlockNumber prev_pages_deleted_not_free;
+	float8		prev_num_heap_tuples;
 
+	/*
+	 * Copy details from metapage to local variables quickly.
+	 *
+	 * Note that we deliberately avoid using cached version of metapage here.
+	 */
 	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	btm_version = metad->btm_version;
+
+	if (btm_version < BTREE_NOVAC_VERSION)
+	{
+		/*
+		 * Metapage needs to be dynamically upgraded to store fields that are
+		 * only present when btm_version >= BTREE_NOVAC_VERSION
+		 */
+		_bt_relbuf(info->index, metabuf);
+		return true;
+	}
+
+	prev_pages_deleted_not_free = metad->btm_last_cleanup_num_delpages;
+	prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	_bt_relbuf(info->index, metabuf);
 
 	/*
-	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
-	 * aggressive about vacuuming non catalog relations by passing the table
-	 * to GlobalVisCheckRemovableXid().
+	 * If table receives enough insertions and no cleanup was performed, then
+	 * index would appear have stale statistics.  If scale factor is set, we
+	 * avoid that by performing cleanup if the number of inserted tuples
+	 * exceeds vacuum_cleanup_index_scale_factor fraction of original tuples
+	 * count.
 	 */
+	relopts = (BTOptions *) info->index->rd_options;
+	cleanup_scale_factor = (relopts &&
+							relopts->vacuum_cleanup_index_scale_factor >= 0)
+		? relopts->vacuum_cleanup_index_scale_factor
+		: vacuum_cleanup_index_scale_factor;
 
-	if (metad->btm_version < BTREE_NOVAC_VERSION)
+	if (cleanup_scale_factor <= 0 ||
+		info->num_heap_tuples < 0 ||
+		prev_num_heap_tuples <= 0 ||
+		(info->num_heap_tuples - prev_num_heap_tuples) /
+		prev_num_heap_tuples >= cleanup_scale_factor)
+		return true;
+
+	/*
+	 * Trigger cleanup in rare cases where prev_pages_deleted_not_free exceeds
+	 * 5% of the total size of the index.  We can reasonably expect (though
+	 * are not guaranteed) to be able to recycle this many pages if we decide
+	 * to do a btvacuumscan call as part of this btvacuumcleanup call.
+	 *
+	 * Our approach won't reliably avoid "wasted" cleanup-only btvacuumscan
+	 * calls.  That is, we can end up scanning the entire index without ever
+	 * placing even 1 of the prev_pages_deleted_not_free pages in the free
+	 * space map, at least in certain narrow cases.
+	 *
+	 * For example, a "wasted" scan will happen when (for whatever reason) no
+	 * XIDs were assigned/allocated since the "# deleted pages" field was last
+	 * set in metapage by VACUUM.  You can observe this yourself by running
+	 * two VACUUM VERBOSE commands one after the other on an otherwise idle
+	 * system.  When the first VACUUM command manages to delete pages that
+	 * were emptied + deleted during btbulkdelete, the second VACUUM command
+	 * won't be able to place those same deleted pages (which won't be newly
+	 * deleted from the perspective of the second VACUUM command) into the FSM
+	 * during btvacuumcleanup.
+	 *
+	 * Another "wasted FSM-driven cleanup scan" scenario can occur when
+	 * VACUUM's ability to do work is hindered by a long held MVCC snapshot.
+	 * The snapshot prevents page recycling/freeing within btvacuumscan,
+	 * though that will be the least of the DBA's problems.
+	 */
+	if (prev_pages_deleted_not_free >
+		RelationGetNumberOfBlocks(info->index) / 20)
+		return true;
+
+	return false;
+}
+
+/*
+ * _bt_page_recyclable() -- Is page recyclable?
+ *
+ * A page is safe to be recycled if it is marked deleted and has a safexid
+ * value that is sufficiently old.
+ */
+static inline bool
+_bt_page_recyclable(BTPageOpaque opaque, Page page)
+{
+	if (P_ISDELETED(opaque))
 	{
 		/*
-		 * Do cleanup if metapage needs upgrade, because we don't have
-		 * cleanup-related meta-information yet.
+		 * If this is a pg_upgrade'd index, then this could be a deleted page
+		 * whose XID (which is stored in special area's level field via type
+		 * punning) is non-full 32-bit value.  It's safe to just assume that
+		 * we can recycle because the system must have been restarted since
+		 * the time of deletion.
 		 */
-		result = true;
-	}
-	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
-	{
-		/*
-		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is visible to everyone, then at least one deleted page can be
-		 * recycled -- don't skip cleanup.
-		 */
-		result = true;
-	}
-	else
-	{
-		BTOptions  *relopts;
-		float8		cleanup_scale_factor;
-		float8		prev_num_heap_tuples;
+		if (!P_HAS_FULLXID(opaque))
+			return true;
 
 		/*
-		 * If table receives enough insertions and no cleanup was performed,
-		 * then index would appear have stale statistics.  If scale factor is
-		 * set, we avoid that by performing cleanup if the number of inserted
-		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
-		 * original tuples count.
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
 		 */
-		relopts = (BTOptions *) info->index->rd_options;
-		cleanup_scale_factor = (relopts &&
-								relopts->vacuum_cleanup_index_scale_factor >= 0)
-			? relopts->vacuum_cleanup_index_scale_factor
-			: vacuum_cleanup_index_scale_factor;
-		prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
-
-		if (cleanup_scale_factor <= 0 ||
-			info->num_heap_tuples < 0 ||
-			prev_num_heap_tuples <= 0 ||
-			(info->num_heap_tuples - prev_num_heap_tuples) /
-			prev_num_heap_tuples >= cleanup_scale_factor)
-			result = true;
+		return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
 	}
 
-	_bt_relbuf(info->index, metabuf);
-	return result;
+	return false;
 }
 
 /*
@@ -973,14 +1029,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber num_pages;
 	BlockNumber scanblkno;
 	bool		needLock;
+	BlockNumber pages_deleted_not_free;
 
 	/*
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command
 	 */
+	stats->num_pages = 0;
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->tuples_removed = 0;
 	stats->pages_deleted = 0;
+	stats->pages_free = 0;
 
 	/* Set up info to pass down to btvacuumpage */
 	vstate.info = info;
@@ -988,8 +1048,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.totFreePages = 0;
-	vstate.oldestBtpoXact = InvalidTransactionId;
 
 	/* Create a temporary memory context to run _bt_pagedel in */
 	vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
@@ -1048,6 +1106,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
+	/* Set statistics num_pages field to final size of index */
+	stats->num_pages = num_pages;
+
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1062,27 +1123,26 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
 	 */
-	if (vstate.totFreePages > 0)
+	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 
 	/*
-	 * Maintain the oldest btpo.xact and a count of the current number of heap
-	 * tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
+	 * Maintain the count of the current number of heap tuples in the
+	 * metapage.  Also maintain the current pages_deleted_not_free value.
+	 * Both values are used within _bt_vacuum_needs_cleanup.
 	 *
-	 * The page with the oldest btpo.xact is typically a page deleted by this
-	 * VACUUM operation, since pages deleted by a previous VACUUM operation
-	 * tend to be placed in the FSM (by the current VACUUM operation) -- such
-	 * pages are not candidates to be the oldest btpo.xact.  (Note that pages
-	 * placed in the FSM are reported as deleted pages in the bulk delete
-	 * statistics, despite not counting as deleted pages for the purposes of
-	 * determining the oldest btpo.xact.)
+	 * pages_deleted_not_free is the number of deleted pages now in the index
+	 * that were not safe to place in the FSM to be recycled just yet.  We
+	 * expect that it will almost certainly be possible to place all of these
+	 * pages in the FSM during the next VACUUM operation.  If the next VACUUM
+	 * operation happens to be cleanup-only, _bt_vacuum_needs_cleanup will be
+	 * called.  We may decide to proceed with a call to btvacuumscan purely
+	 * because there are lots of deleted pages not yet placed in the FSM.
 	 */
-	_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+	Assert(stats->pages_deleted >= stats->pages_free);
+	pages_deleted_not_free = stats->pages_deleted - stats->pages_free;
+	_bt_update_meta_cleanup_info(rel, pages_deleted_not_free,
 								 info->num_heap_tuples);
-
-	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
 }
 
 /*
@@ -1188,13 +1248,12 @@ backtrack:
 		}
 	}
 
-	/* Page is valid, see what to do with it */
-	if (_bt_page_recyclable(page))
+	if (!opaque || _bt_page_recyclable(opaque, page))
 	{
 		/* Okay to recycle this page (which could be leaf or internal) */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->totFreePages++;
 		stats->pages_deleted++;
+		stats->pages_free++;
 	}
 	else if (P_ISDELETED(opaque))
 	{
@@ -1203,17 +1262,12 @@ backtrack:
 		 * recycle yet.
 		 */
 		stats->pages_deleted++;
-
-		/* Maintain the oldest btpo.xact */
-		if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
-			TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
-			vstate->oldestBtpoXact = opaque->btpo.xact;
 	}
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * oldestBtpoXact and pages_deleted below.
+		 * pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1430,7 +1484,7 @@ backtrack:
 		 * count.  There will be no double-counting.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
+		stats->pages_deleted += _bt_pagedel(rel, buf);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2e3bda8171..d1177d8772 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
-		if (opaque->btpo.level == 1 && access == BT_WRITE)
+		if (opaque->btpo_level == 1 && access == BT_WRITE)
 			page_access = BT_WRITE;
 
 		/* drop the read lock on the page, then acquire one on its child */
@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 		}
 
 		/* Done? */
-		if (opaque->btpo.level == level)
+		if (opaque->btpo_level == level)
 			break;
-		if (opaque->btpo.level < level)
+		if (opaque->btpo_level < level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("btree level %u not found in index \"%s\"",
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5683daa34d..2c4d7f6e25 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
 	/* Initialize BT opaque state */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_prev = opaque->btpo_next = P_NONE;
-	opaque->btpo.level = level;
+	opaque->btpo_level = level;
 	opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
 	opaque->btpo_cycleid = 0;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..b6afe9526e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_fastlevel = xlrec->fastlevel;
 	/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
-	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
+	md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
 
@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
 
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = spagenumber;
-	ropaque->btpo.level = xlrec->level;
+	ropaque->btpo_level = xlrec->level;
 	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = xlrec->leftblk;
 	pageop->btpo_next = xlrec->rightblk;
-	pageop->btpo.level = 0;
+	pageop->btpo_level = 0;
 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
+	uint32		level;
+	bool		isleaf;
+	FullTransactionId safexid;
 	Buffer		leftbuf;
 	Buffer		target;
 	Buffer		rightbuf;
@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	leftsib = xlrec->leftsib;
 	rightsib = xlrec->rightsib;
+	level = xlrec->level;
+	isleaf = (level == 0);
+	safexid = xlrec->safexid;
+
+	/* No topparent link for leaf page (level 0) or level 1 */
+	Assert(xlrec->topparent == InvalidBlockNumber || level > 1);
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = leftsib;
 	pageop->btpo_next = rightsib;
-	pageop->btpo.xact = xlrec->btpo_xact;
-	pageop->btpo_flags = BTP_DELETED;
-	if (!BlockNumberIsValid(xlrec->topparent))
+	pageop->btpo_level = level;
+	BTPageSetDeleted(page, safexid);
+	if (isleaf)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		Buffer				leafbuf;
 		IndexTupleData		trunctuple;
 
+		Assert(!isleaf);
+
 		leafbuf = XLogInitBufferForRedo(record, 3);
 		page = (Page) BufferGetPage(leafbuf);
 
@@ -901,7 +912,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 		pageop->btpo_prev = xlrec->leafleftsib;
 		pageop->btpo_next = xlrec->leafrightsib;
-		pageop->btpo.level = 0;
+		pageop->btpo_level = 0;
 		pageop->btpo_cycleid = 0;
 
 		/* Add a dummy hikey item */
@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
 
 	pageop->btpo_flags = BTP_ROOT;
 	pageop->btpo_prev = pageop->btpo_next = P_NONE;
-	pageop->btpo.level = xlrec->level;
+	pageop->btpo_level = xlrec->level;
 	if (xlrec->level == 0)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
@@ -972,17 +983,15 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	 * Btree reuse_page records exist to provide a conflict point when we
 	 * reuse pages in the index via the FSM.  That's all they do though.
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The
-	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * latestRemovedXid was the page's deleteXid.  The
+	 * GlobalVisCheckRemovableFullXid(deleteXid) test in _bt_page_recyclable()
+	 * conceptually mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..5cce10a5b6 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -80,9 +80,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
 
-				appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
-								 xlrec->leftsib, xlrec->rightsib,
-								 xlrec->btpo_xact);
+				appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
+								 xlrec->leftsib, xlrec->rightsib, xlrec->level,
+								 EpochFromFullTransactionId(xlrec->safexid),
+								 XidFromFullTransactionId(xlrec->safexid));
 				appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
 								 xlrec->leafleftsib, xlrec->leafrightsib,
 								 xlrec->topparent);
@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
 
-				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
+				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->latestRemovedXid);
+								 xlrec->node.relNode,
+								 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+								 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 				break;
 			}
 		case XLOG_BTREE_META_CLEANUP:
@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
-								 xlrec->oldest_btpo_xact,
+				appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
+								 xlrec->last_cleanup_num_delpages,
 								 xlrec->last_cleanup_num_heap_tuples);
 				break;
 			}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..0eeb766943 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+/*
+ * Variant of ResolveRecoveryConflictWithSnapshot that works with
+ * FullTransactionId values
+ */
+void
+ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+										   RelFileNode node)
+{
+	/*
+	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
+	 * so truncate the logged FullTransactionId.  If the logged value is very
+	 * old, so that XID wrap-around already happened on it, there can't be any
+	 * snapshots that still see it.
+	 */
+	FullTransactionId nextXid = ReadNextFullTransactionId();
+	uint64			  diff;
+
+	diff = U64FromFullTransactionId(nextXid) -
+		U64FromFullTransactionId(latestRemovedFullXid);
+	if (diff < MaxTransactionId / 2)
+	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+	}
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index b8c7793d9e..0272c6554f 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 											  P_FIRSTDATAKEY(opaque));
 				itup = (IndexTuple) PageGetItem(state->target, itemid);
 				nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
-				nextleveldown.level = opaque->btpo.level - 1;
+				nextleveldown.level = opaque->btpo_level - 1;
 			}
 			else
 			{
@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 		if (opaque->btpo_prev != leftcurrent)
 			bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
 
-		/* Check level, which must be valid for non-ignorable page */
-		if (level.level != opaque->btpo.level)
+		/* Check level */
+		if (level.level != opaque->btpo_level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										current, level.level, opaque->btpo.level)));
+										current, level.level, opaque->btpo_level)));
 
 		/* Verify invariants for page */
 		bt_target_page_check(state);
@@ -1167,7 +1167,7 @@ bt_target_page_check(BtreeCheckState *state)
 				bt_child_highkey_check(state,
 									   offset,
 									   NULL,
-									   topaque->btpo.level);
+									   topaque->btpo_level);
 			}
 			continue;
 		}
@@ -1529,7 +1529,7 @@ bt_target_page_check(BtreeCheckState *state)
 	if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
 	{
 		bt_child_highkey_check(state, InvalidOffsetNumber,
-							   NULL, topaque->btpo.level);
+							   NULL, topaque->btpo_level);
 	}
 }
 
@@ -1606,7 +1606,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 		ereport(DEBUG1,
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
-						opaque->btpo.level, RelationGetRelationName(state->rel)),
+						opaque->btpo_level, RelationGetRelationName(state->rel)),
 				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
 
 		/* Be slightly more pro-active in freeing this memory, just in case */
@@ -1910,14 +1910,15 @@ bt_child_highkey_check(BtreeCheckState *state,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 
-		/* Check level for non-ignorable page */
-		if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
+		/* Do level sanity check */
+		if ((!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque)) &&
+			opaque->btpo_level != target_level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										blkno, target_level - 1, opaque->btpo.level)));
+										blkno, target_level - 1, opaque->btpo_level)));
 
 		/* Try to detect circular links */
 		if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
@@ -2145,7 +2146,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
 	 * check for downlink connectivity.
 	 */
 	bt_child_highkey_check(state, downlinkoffnum,
-						   child, topaque->btpo.level);
+						   child, topaque->btpo_level);
 
 	/*
 	 * Since there cannot be a concurrent VACUUM operation in readonly mode,
@@ -2290,7 +2291,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 				 errmsg("harmless interrupted page split detected in index %s",
 						RelationGetRelationName(state->rel)),
 				 errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
-									blkno, opaque->btpo.level,
+									blkno, opaque->btpo_level,
 									opaque->btpo_prev,
 									(uint32) (pagelsn >> 32),
 									(uint32) pagelsn)));
@@ -2321,7 +2322,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 	elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
 		 RelationGetRelationName(state->rel));
 
-	level = opaque->btpo.level;
+	level = opaque->btpo_level;
 	itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
 	itup = (IndexTuple) PageGetItem(page, itemid);
 	childblk = BTreeTupleGetDownLink(itup);
@@ -2336,16 +2337,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			break;
 
 		/* Do an extra sanity check in passing on internal pages */
-		if (copaque->btpo.level != level - 1)
+		if (copaque->btpo_level != level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
 									 RelationGetRelationName(state->rel)),
 					 errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
 										blkno, childblk,
-										level - 1, copaque->btpo.level)));
+										level - 1, copaque->btpo_level)));
 
-		level = copaque->btpo.level;
+		level = copaque->btpo_level;
 		itemid = PageGetItemIdCareful(state, childblk, child,
 									  P_FIRSTDATAKEY(copaque));
 		itup = (IndexTuple) PageGetItem(child, itemid);
@@ -2407,7 +2408,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			 errmsg("internal index block lacks downlink in index \"%s\"",
 					RelationGetRelationName(state->rel)),
 			 errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
-								blkno, opaque->btpo.level,
+								blkno, opaque->btpo_level,
 								(uint32) (pagelsn >> 32),
 								(uint32) pagelsn)));
 }
@@ -3002,21 +3003,26 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	}
 
 	/*
-	 * Deleted pages have no sane "level" field, so can only check non-deleted
-	 * page level
+	 * Deleted pages that still use the old 32-bit XID representation have no
+	 * sane "level" field because they type pun the field, but all other pages
+	 * (including pages deleted on Postgres 14+) have a valid value.
 	 */
-	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
-						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+	if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
+	{
+		/* Okay, no reason not to trust btpo_level field from page */
 
-	if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
-		opaque->btpo.level == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
-						blocknum, RelationGetRelationName(state->rel))));
+		if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+							opaque->btpo_level, blocknum, RelationGetRelationName(state->rel))));
+
+		if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+							blocknum, RelationGetRelationName(state->rel))));
+	}
 
 	/*
 	 * Sanity checks for number of items on page.
@@ -3064,7 +3070,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * from version 9.4 on, so do the same here.  See _bt_pagedel() for full
 	 * details.
 	 *
-	 * Internal pages should never have garbage items, either.
+	 * Also check that internal pages have no garbage items, and that no page
+	 * has an invalid combination of page deletion related page level flags.
 	 */
 	if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
 		ereport(ERROR,
@@ -3079,6 +3086,18 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 				 errmsg("internal page block %u in index \"%s\" has garbage items",
 						blocknum, RelationGetRelationName(state->rel))));
 
+	if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
+								 blocknum, RelationGetRelationName(state->rel))));
+
+	if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
+								 blocknum, RelationGetRelationName(state->rel))));
+
 	return page;
 }
 
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..b7725b572f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -75,11 +75,7 @@ typedef struct BTPageStat
 	/* opaque data */
 	BlockNumber btpo_prev;
 	BlockNumber btpo_next;
-	union
-	{
-		uint32		level;
-		TransactionId xact;
-	}			btpo;
+	uint32		btpo_level;
 	uint16		btpo_flags;
 	BTCycleId	btpo_cycleid;
 } BTPageStat;
@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* page type (flags) */
 	if (P_ISDELETED(opaque))
 	{
-		stat->type = 'd';
-		stat->btpo.xact = opaque->btpo.xact;
-		return;
+		/* We divide deleted pages into leaf ('d') or internal ('D') */
+		if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
+			stat->type = 'd';
+		else
+			stat->type = 'D';
+
+		/*
+		 * Report safexid in a deleted page.
+		 *
+		 * Handle pg_upgrade'd deleted pages that used the previous safexid
+		 * representation in btpo_level field (this used to be a union type
+		 * called "bpto").
+		 */
+		if (P_HAS_FULLXID(opaque))
+		{
+			FullTransactionId safexid = BTPageGetDeleteXid(page);
+
+			elog(NOTICE, "deleted page from block %u has safexid %u:%u",
+				 blkno, EpochFromFullTransactionId(safexid),
+				 XidFromFullTransactionId(safexid));
+		}
+		else
+			elog(NOTICE, "deleted page from block %u has safexid %u",
+				 blkno, opaque->btpo_level);
+
+		/* Don't interpret BTDeletedPageData as index tuples */
+		maxoff = InvalidOffsetNumber;
 	}
 	else if (P_IGNORE(opaque))
 		stat->type = 'e';
@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* btpage opaque data */
 	stat->btpo_prev = opaque->btpo_prev;
 	stat->btpo_next = opaque->btpo_next;
-	stat->btpo.level = opaque->btpo.level;
+	stat->btpo_level = opaque->btpo_level;
 	stat->btpo_flags = opaque->btpo_flags;
 	stat->btpo_cycleid = opaque->btpo_cycleid;
 
@@ -237,7 +257,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 	values[j++] = psprintf("%u", stat.free_size);
 	values[j++] = psprintf("%u", stat.btpo_prev);
 	values[j++] = psprintf("%u", stat.btpo_next);
-	values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
+	values[j++] = psprintf("%u", stat.btpo_level);
 	values[j++] = psprintf("%d", stat.btpo_flags);
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
@@ -503,10 +523,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
 
-		if (P_ISDELETED(opaque))
-			elog(NOTICE, "page is deleted");
-
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageData as index tuples */
+			elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -603,7 +627,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		if (P_ISDELETED(opaque))
 			elog(NOTICE, "page is deleted");
 
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageData as index tuples */
+			elog(NOTICE, "page from block is deleted");
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -692,10 +723,7 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * We need a kluge here to detect API versions prior to 1.8.  Earlier
-	 * versions incorrectly used int4 for certain columns.  This caused
-	 * various problems.  For example, an int4 version of the "oldest_xact"
-	 * column would not work with TransactionId values that happened to exceed
-	 * PG_INT32_MAX.
+	 * versions incorrectly used int4 for certain columns.
 	 *
 	 * There is no way to reliably avoid the problems created by the old
 	 * function definition at this point, so insist that the user update the
@@ -723,7 +751,8 @@ bt_metap(PG_FUNCTION_ARGS)
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
+		values[j++] = psprintf(INT64_FORMAT,
+							   (int64) metad->btm_last_cleanup_num_delpages);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index a7632be36a..ede3e55935 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -3,16 +3,16 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
 CREATE INDEX test1_a_idx ON test1 USING btree (a);
 \x
 SELECT * FROM bt_metap('test1_a_idx');
--[ RECORD 1 ]-----------+-------
-magic                   | 340322
-version                 | 4
-root                    | 1
-level                   | 0
-fastroot                | 1
-fastlevel               | 0
-oldest_xact             | 0
-last_cleanup_num_tuples | -1
-allequalimage           | t
+-[ RECORD 1 ]-------------+-------
+magic                     | 340322
+version                   | 4
+root                      | 1
+level                     | 0
+fastroot                  | 1
+fastlevel                 | 0
+last_cleanup_num_delpages | 0
+last_cleanup_num_tuples   | -1
+allequalimage             | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', -1);
 ERROR:  invalid block number
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index 79a42a7b11..4b03b84478 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -66,6 +66,23 @@ RETURNS smallint
 AS 'MODULE_PATHNAME', 'page_checksum_1_9'
 LANGUAGE C STRICT PARALLEL SAFE;
 
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT last_cleanup_num_delpages int8,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
 --
 -- bt_page_stats()
 --
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index b1ce0d77d7..5368bb30f0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 		page = BufferGetPage(buffer);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		/* Determine page type, and update totals */
-
+		/*
+		 * Determine page type, and update totals.
+		 *
+		 * Note that we arbitrarily bucket deleted pages together without
+		 * considering if they're leaf pages or internal pages.
+		 */
 		if (P_ISDELETED(opaque))
 			indexStat.deleted_pages++;
 		else if (P_IGNORE(opaque))
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index c733341984..67403ecc55 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -298,16 +298,16 @@ test=# SELECT t_ctid, raw_flags, combined_flags
       index's metapage.  For example:
 <screen>
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
--[ RECORD 1 ]-----------+-------
-magic                   | 340322
-version                 | 4
-root                    | 1
-level                   | 0
-fastroot                | 1
-fastlevel               | 0
-oldest_xact             | 582
-last_cleanup_num_tuples | 1000
-allequalimage           | f
+-[ RECORD 1 ]-------------+-------
+magic                     | 340322
+version                   | 4
+root                      | 1
+level                     | 0
+fastroot                  | 1
+fastlevel                 | 0
+last_cleanup_num_delpages | 0
+last_cleanup_num_tuples   | 230
+allequalimage             | f
 </screen>
      </para>
     </listitem>
-- 
2.27.0

v4-0003-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchapplication/octet-stream; name=v4-0003-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patchDownload

From 7806a193b32a20a11096815a86c405eee639baf0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 14 Feb 2021 17:38:34 -0800
Subject: [PATCH v4 3/3] Add pages_newly_deleted to VACUUM VERBOSE.

pages_newly_deleted reports on the number of pages deleted by the
current VACUUM operation.  The pages_deleted field continues to report
on the total number of deleted pages in the index (as well as pages that
are recyclable due to being zeroed in rare cases), without regard to
whether or not this VACUUM operation deleted them.
---
 src/include/access/genam.h            | 11 ++++++++---
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  |  4 +++-
 src/backend/access/heap/vacuumlazy.c  |  4 +++-
 src/backend/access/nbtree/nbtpage.c   |  4 ++++
 src/backend/access/nbtree/nbtree.c    | 12 +++++++++---
 src/backend/access/spgist/spgvacuum.c |  1 +
 7 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ffa1a4c80d..13971c8b2a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -63,8 +63,12 @@ typedef struct IndexVacuumInfo
  * of which this is just the first field; this provides a way for ambulkdelete
  * to communicate additional private data to amvacuumcleanup.
  *
- * Note: pages_deleted and pages_free refer to free space within the index
- * file.  Some index AMs may compute num_index_tuples by reference to
+ * Note: pages_newly_deleted is the number of pages in the index that were
+ * deleted by the current vacuum operation.  pages_deleted and pages_free
+ * refer to free space within the index file (and so pages_deleted must be >=
+ * pages_newly_deleted).
+ *
+ * Note: Some index AMs may compute num_index_tuples by reference to
  * num_heap_tuples, in which case they should copy the estimated_count field
  * from IndexVacuumInfo.
  */
@@ -74,7 +78,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_newly_deleted;	/* # pages marked deleted by us  */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a0453b36cd..a276eb020b 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -231,6 +231,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
+	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ddecb8ab18..196678d91f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -138,6 +138,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_deleted = 0;
 	stats->pages_free = 0;
 
@@ -281,8 +282,8 @@ restart:
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->stats->pages_free++;
 		vstate->stats->pages_deleted++;
+		vstate->stats->pages_free++;
 	}
 	else if (GistPageIsDeleted(page))
 	{
@@ -636,6 +637,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* mark the page as deleted */
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
+	stats->pages_newly_deleted++;
 	stats->pages_deleted++;
 
 	/* remove the downlink from the parent */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e9bd6dba80..7348f91a2e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,9 +2521,11 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages were deleted.\n"
+						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
+						   (*stats)->pages_newly_deleted,
 						   (*stats)->pages_deleted, (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 408230cb67..f5aa061254 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2677,11 +2677,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
+	 * Maintain pages_newly_deleted, which is simply the number of pages
+	 * deleted by the ongoing VACUUM operation.
+	 *
 	 * Maintain pages_deleted in a way that takes into account how
 	 * btvacuumpage() will count deleted pages that have yet to become
 	 * scanblkno -- only count page when it's not going to get that treatment
 	 * later on.
 	 */
+	stats->pages_newly_deleted++;
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c3b32bb71c..b9ba521acd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -922,6 +922,9 @@ _bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
 {
 	IndexBulkDeleteResult *stats = vstate->stats;
 
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
 	/* Recompute VACUUM XID boundaries */
 	(void) GetOldestNonRemovableTransactionId(NULL);
 
@@ -1057,6 +1060,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->tuples_removed = 0;
+	stats->pages_newly_deleted = 0;
 	stats->pages_deleted = 0;
 	stats->pages_free = 0;
 
@@ -1307,8 +1311,8 @@ backtrack:
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
-		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * Half-dead leaf page.  Try to delete now.  Might end up incrementing
+		 * pages_newly_deleted/pages_deleted inside _bt_pagedel.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1520,7 +1524,9 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * _bt_pagedel maintains the bulk delete stats on our behalf
+		 * _bt_pagedel maintains the bulk delete stats on our behalf;
+		 * pages_newly_deleted and pages_deleted are likely to be incremented
+		 * during call
 		 */
 		Assert(blkno == scanblkno);
 		_bt_pagedel(rel, buf, vstate);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

#15

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#13)

Re: 64-bit XIDs in deleted nbtree pages

On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that there already are huge problems in that case. But I think
we need to consider an append-only case as well; after bulk deletion
on an append-only table, vacuum deletes heap tuples and index tuples,
marking some index pages as dead and setting an XID into btpo.xact.
Since we trigger autovacuums even by insertions based on
autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
the table again. But if there is a long-running query a "wasted"
cleanup scan could happen many times depending on the values of
autovacuum_vacuum_insert_scale_factor/threshold and
vacuum_cleanup_index_scale_factor. This should not happen in the old
code. I agree this is DBA problem but it also means this could bring
another new problem in a long-running query case.

I see your point.

My guess is that this concern of yours is somehow related to how we do
deletion and recycling *in general*. Currently (and even in v3 of the
patch), we assume that recycling the pages that a VACUUM operation
deletes will happen "eventually". This kind of makes sense when you
have "typical vacuuming" -- deletes/updates, and no big bursts, rare
bulk deletes, etc.

But when you do have a mixture of different triggering positions,
which is quite possible, it is difficult to understand what
"eventually" actually means...

BTW, I am thinking about making recycling take place for pages that
were deleted during the same VACUUM. We can just use a
work_mem-limited array to remember a list of blocks that are deleted
but not yet recyclable (plus the XID found in the block).

...which brings me back to this idea.

I've prototyped this. It works really well. In most cases the
prototype makes VACUUM operations with nbtree index page deletions
also recycle the pages that were deleted, at the end of the
btvacuumscan(). We do very little or no "indefinite deferring" work
here. This has obvious advantages, of course, but it also has a
non-obvious advantage: the awkward question of concerning "what
eventually actually means" with mixed triggering conditions over time
mostly goes away. So perhaps this actually addresses your concern,
Masahiko.

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

I've been testing this with BenchmarkSQL [1], which has several
indexes that regularly need page deletions. There is also a realistic
"life cycle" to the data in these indexes. I added custom
instrumentation to display information about what's going on with page
deletion when the benchmark is run. I wrote a quick-and-dirty patch
that makes log_autovacuum show the same information that you see about
index page deletion when VACUUM VERBOSE is run (including the new
pages_newly_deleted field from my patch). With this particular
TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
on to place every page that it deletes in the FSM without leaving
anything to the next VACUUM. There are a very small number of
exceptions where we "only" manage to recycle maybe 95% of the pages
that were deleted.

Great!

The race condition that nbtree avoids by deferring recycling was
always a narrow one, outside of the extremes -- the way we defer has
always been overkill. It's almost always unnecessary to delay placing
deleted pages in the FSM until the *next* VACUUM. We only have to
delay it until the end of the *same* VACUUM -- why wait until the next
VACUUM if we don't have to? In general this deferring recycling
business has nothing to do with MVCC/GC/whatever, and yet the code
seems to suggest that it does. While it is convenient to use an XID
for page deletion and recycling as a way of implementing what Lanin &
Shasha call "the drain technique" [2], all we have to do is prevent
certain race conditions. This is all about the index itself, the data
structure, how it is maintained -- nothing more. It almost seems
obvious to me.

Agreed.

It's still possible to imagine extremes. Extremes that even the "try
to recycle pages we ourselves deleted when we reach the end of
btvacuumscan()" version of my patch cannot deal with. Maybe it really
is true that it's inherently impossible to recycle a deleted page even
at the end of a VACUUM -- maybe a long-running transaction (that could
in principle have a stale link to our deleted page) starts before we
VACUUM, and lasts after VACUUM finishes. So it's just not safe. When
that happens, we're back to having the original problem: we're relying
on some *future* VACUUM operation to do that for us at some indefinite
point in the future. It's fair to wonder: What are the implications of
that? Are we not back to square one? Don't we have the same "what does
'eventually' really mean" problem once again?

I think that that's okay, because this remaining case is a *truly*
extreme case (especially with a large index, where index vacuuming
will naturally take a long time).

Right.

It will be rare. But more importantly, the fact that scenario is now
an extreme case justifies treating it as an extreme case. We can teach
_bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
particular, I think that it will now be okay to increase the threshold
applied when considering deleted pages inside
_bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
patch. But in v4, which has the new recycling enhancement, I think
that it would be sensible to make it 5%, or maybe even 10%. This
naturally makes Masahiko's problem scenario unlikely to actually
result in a truly wasted call to btvacuumscan(). The number of pages
that the metapage indicates are "deleted but not yet placed in the
FSM" will be close to the theoretical minimum, because we're no longer
naively throwing away information about which specific pages will be
recyclable soon. Which is what the current approach does, really.

Yeah, increasing the threshold would solve the problem in most cases.
Given that nbtree index page deletion is unlikely to happen in
practice, having the threshold 5% or 10% seems to avoid the problem in
nearly 100% of cases, I think.

Another idea I come up with (maybe on top of above your idea) is to
change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
btpo.xact XID among all deleted pages when the total amount of deleted
pages exceeds 2% of index. That way, we surely can recycle more than
2% of index when the XID becomes older than the global xmin.

Also, maybe we can record deleted pages to FSM even without deferring
and check it when re-using. That is, when we get a free page from FSM
we check if the page is really recyclable (maybe _bt_getbuf() already
does this?). IOW, a deleted page can be recycled only when it's
requested to be reused. If btpo.xact is 64-bit XID we never need to
worry about the case where a deleted page never be requested to be
reused.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#16

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#15)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

Great!

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

I was just going to use work_mem. This should be okay. Note that
CREATE INDEX uses an additional work_mem allocation when building a
unique index, for the second spool/tuplesort. That seems like a
precedent that I can follow here.

Right now the BTPendingRecycle struct the patch uses to store
information about a page that the current VACUUM deleted (and may yet
be able to place in the FSM) are each 16 bytes (including alignment
overhead). I could probably make them smaller with a little work, but
even now that's quite small. Even with the default 4MiB work_mem
setting we can fit information about 262144 pages all at once. That's
2GiB worth of deleted index pages, which is generally much more than
we'll need.

Yeah, increasing the threshold would solve the problem in most cases.
Given that nbtree index page deletion is unlikely to happen in
practice, having the threshold 5% or 10% seems to avoid the problem in
nearly 100% of cases, I think.

Of course it all depends on workload/index characteristics, in the
end. It is very rare to delete a percentage of the index that exceeds
autovacuum_vacuum_scale_factor -- that's the important thing here IMV.

Another idea I come up with (maybe on top of above your idea) is to
change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
btpo.xact XID among all deleted pages when the total amount of deleted
pages exceeds 2% of index. That way, we surely can recycle more than
2% of index when the XID becomes older than the global xmin.

You could make my basic approach to recycling deleted pages earlier
(ideally at the end of the same btvacuumscan() that deleted the pages
in the first place) more sophisticated in a variety of ways. These are
all subject to diminishing returns, though.

I've already managed to recycle close to 100% of all B-Tree pages
during the same VACUUM with a very simple approach -- at least if we
assume BenchmarkSQL is representative. It is hard to know how much
more effort can be justified. To be clear, I'm not saying that an
improved design cannot be justified now or in the future (BenchmarkSQL
is not like many workloads that people use Postgres for). I'm just
saying that I *don't know* where to draw the line. Any particular
place that we draw the line feels a little arbitrary to me. This
includes my own choice of the work_mem-limited BTPendingRecycle array.
My patch currently works that way because it's simple -- no other
reason.

Any scheme to further improve the "work_mem-limited BTPendingRecycle
array" design from my patch boils down to this: A new approach that
makes recycling of any remaining deleted pages take place "before too
long": After the end of the btvacuumscan() BTPendingRecycle array
stuff (presumably that didn't work out in cases where an improved
approach matters), but before the next VACUUM takes place (since that
will do the required recycling anyway, unless it's unable to do any
work at all, in which case it hardly matters). Here are two ideas of
my own in this same class as your idea:

1. Remember to do some of the BTPendingRecycle array FSM processing
stuff in btvacuumcleanup() -- defer some of the recycling of pages
recorded in BTPendingRecycle entries (paged deleted during
btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.

Right now btvacuumcleanup() will always do nothing when btbulkdelete()
was called earlier. But that's just a current nbtree convention, and
is no reason to not do this (we don't have to scan the index again at
all). The advantage of teaching btvacuumcleanup() to do this is that
it delays the "BTPendingRecycle array FSM processing" stuff until the
last moment that it is still easy to use the in-memory array (because
we haven't freed it yet). In general, doing it later makes it more
likely that we'll successfully recycle the pages. Though in general it
might not make any difference -- so we're still hoping that the
workload allows us to recycle everything we deleted, without making
the design much more complicated than what I posted already.

(BTW I see that you reviewed commit 4e514c61, so you must have thought
about the trade-off between doing deferred recycling in
amvacuumcleanup() vs ambulkdelete(), when to call
IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
implement this idea while calling IndexFreeSpaceMapVacuum() during
both btvacuumcleanup() and btbulkdelete(), so that we get the best of
both worlds -- fast recycling *and* more delayed processing that is
more likely to ultimately succeed.)

2. Remember/serialize the BTPendingRecycle array when we realize that
we cannot put all recyclable pages in the FSM at the end of the
current btvacuumscan(), and then use an autovacuum work item to
process them before too long -- a call to AutoVacuumRequestWork()
could even serialize the data on disk.

Idea 2 has the advantage of allowing retries -- eventually it will be
safe to recycle the pages, if we just wait long enough.

Anyway, I'm probably not going to pursue either of the 2 ideas for
Postgres 14. I'm mentioning these ideas now because the trade-offs
show that there is no perfect design for this deferring recycling
stuff. Whatever we do, we should accept that there is no perfect
design.

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

Also, maybe we can record deleted pages to FSM even without deferring
and check it when re-using. That is, when we get a free page from FSM
we check if the page is really recyclable (maybe _bt_getbuf() already
does this?). IOW, a deleted page can be recycled only when it's
requested to be reused. If btpo.xact is 64-bit XID we never need to
worry about the case where a deleted page never be requested to be
reused.

I've thought about that too (both now and in the past). You're right
about _bt_getbuf() -- it checks the XID, at least on the master
branch. I took that XID check out in v4 of the patch, but I am now
starting to have my doubts about that particular choice. (I'm probably
going to restore the XID check in _bt_getbuf in v5 of the patch.)

I took the XID-is-recyclable check out in v4 of the patch because it
might leak pages in rare cases -- which is not a new problem.
_bt_getbuf() currently has a remarkably relaxed attitude about leaking
pages from the FSM (it is more relaxed about it than I am, certainly)
-- but why should we just accept leaking pages like that? My new
doubts about it are non-specific, though. We know that the FSM isn't
crash safe -- but I think that that reduces to "practically speaking,
we can never 100% trust the FSM". Which makes me nervous. I worry that
the FSM can do something completely evil and crazy in rare cases.

It's not just crash safety. The FSM's fsm_search_avail() function
currently changes the fp_next_slot field with only a shared buffer
lock held. It's an int, which is supposed to "be atomic on most
platforms". But we should be using real atomic ops. So the FSM is
generally...kind of wonky.

In an ideal world, nbtree page deletion + recycling would have crash
safety built in. I don't think that it makes sense to not have free
space management without crash safety in the case of index AMs,
because it's just not worth it with whole-page units of free space
(heapam is another story). A 100% crash-safe design would naturally
shift the problem of nbtree page recycle safety from the
producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
would be a real improvement. But these long standing FSM issues are
not going to change for Postgres 14. And so changing _bt_getbuf() to
do clever things with XIDs won't be possible for Postgres 14 IMV.

--
Peter Geoghegan

#17

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#16)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

Speaking of problems like this, I think I spotted an old one: we call
_bt_update_meta_cleanup_info() in either btbulkdelete() or
btvacuumcleanup(). I think that we should always call it in
btvacuumcleanup(), though -- even in cases where there is no call to
btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
happened earlier instead, during the btbulkdelete() call).

This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
we store in the metapage) much more accurate -- right now it's always
pg_class.reltuples from *before* the VACUUM started. And so the
btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
kind of inaccurate.

This "estimate during ambulkdelete" issue is documented here (kind of):

/*
* Struct for input arguments passed to ambulkdelete and amvacuumcleanup
*
* num_heap_tuples is accurate only when estimated_count is false;
* otherwise it's just an estimate (currently, the estimate is the
* prior value of the relation's pg_class.reltuples field, so it could
* even be -1). It will always just be an estimate during ambulkdelete.
*/
typedef struct IndexVacuumInfo
{
...
}

The name of the metapage field is already
btm_last_cleanup_num_heap_tuples, which already suggests the approach
that I propose now. So why don't we do it like that already?

(Thinks some more...)

I wonder: did this detail change at the last minute during the
development of the feature (just before commit 857f9c36) back in early
2018? That change have made it easier to deal with
oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
the patch -- so maybe it's truly an accident that the code doesn't
work the way that I suggest it should already. (It's annoying to make
state from btbulkdelete() appear in btvacuumcleanup(), unless it's
from IndexVacuumInfo or something -- I can imagine this changing at
the last minute, just for that reason.)

Do you think that this needs to be treated as a bug in the
backbranches, Masahiko? I'm not sure...

In any case we should probably make this change as part of Postgres
14. Don't you think? It's certainly easy to do it this way now, since
there will be no need to keep around a oldestBtpoXact value until
btvacuumcleanup() (in the common case where btbulkdelete() is where we
call btvacuumscan()). The new btm_last_cleanup_num_delpages field
(which replaces btm_oldest_btpo_xact) has a value that just comes from
the bulk stats, which is easy anyway.

--
Peter Geoghegan

#18

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#17)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 16, 2021 at 3:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

Speaking of problems like this, I think I spotted an old one: we call
_bt_update_meta_cleanup_info() in either btbulkdelete() or
btvacuumcleanup(). I think that we should always call it in
btvacuumcleanup(), though -- even in cases where there is no call to
btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
happened earlier instead, during the btbulkdelete() call).

This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
we store in the metapage) much more accurate -- right now it's always
pg_class.reltuples from *before* the VACUUM started. And so the
btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
kind of inaccurate.

This "estimate during ambulkdelete" issue is documented here (kind of):

/*
* Struct for input arguments passed to ambulkdelete and amvacuumcleanup
*
* num_heap_tuples is accurate only when estimated_count is false;
* otherwise it's just an estimate (currently, the estimate is the
* prior value of the relation's pg_class.reltuples field, so it could
* even be -1). It will always just be an estimate during ambulkdelete.
*/
typedef struct IndexVacuumInfo
{
...
}

The name of the metapage field is already
btm_last_cleanup_num_heap_tuples, which already suggests the approach
that I propose now. So why don't we do it like that already?

(Thinks some more...)

I wonder: did this detail change at the last minute during the
development of the feature (just before commit 857f9c36) back in early
2018? That change have made it easier to deal with
oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
the patch -- so maybe it's truly an accident that the code doesn't
work the way that I suggest it should already. (It's annoying to make
state from btbulkdelete() appear in btvacuumcleanup(), unless it's
from IndexVacuumInfo or something -- I can imagine this changing at
the last minute, just for that reason.)

Do you think that this needs to be treated as a bug in the
backbranches, Masahiko? I'm not sure...

Ugh, yes, I think it's a bug.

When developing this feature, in an old version patch, we used to set
invalid values to both btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these
values. But we decided to set valid values to both even in
btbulkdelete(). I believe that decision was correct in terms of
btm_oldest_btpo_xact because with the old version patch we will do an
unnecessary index scan during btvacuumcleanup(). But it’s wrong in
terms of btm_last_cleanup_num_heap_tuples, as you pointed out.

This bug would make the check of vacuum_cleanup_index_scale_factor
untrust. So I think it’s better to backpatch but I think we need to
note that to fix this issue properly, in a case where a vacuum called
btbulkdelete() earlier, probably we should update only
btm_oldest_btpo_xact in btbulkdelete() and then update
btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case,
we don’t know the oldest btpo.xact among the deleted pages in
btvacuumcleanup(). This means that we would need to update the meta
page twice, leading to WAL logging twice. Since we already could
update the meta page more than once when a vacuum calls btbulkdelete()
multiple times I think it would not be a problem, though.

In any case we should probably make this change as part of Postgres
14. Don't you think? It's certainly easy to do it this way now, since
there will be no need to keep around a oldestBtpoXact value until
btvacuumcleanup() (in the common case where btbulkdelete() is where we
call btvacuumscan()). The new btm_last_cleanup_num_delpages field
(which replaces btm_oldest_btpo_xact) has a value that just comes from
the bulk stats, which is easy anyway.

Agreed.

As I mentioned above, we might need to consider how btbulkdelete() can
tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a
vacuum called btbulkdelete earlier. During parallel vacuum, two
different processes could do btbulkdelete() and btvacuumcleanup()
respectively. Updating those values separately in those callbacks
would be straightforward.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#19

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#18)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 16, 2021 at 4:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Ugh, yes, I think it's a bug.

I was actually thinking of a similar bug in nbtree deduplication when
I spotted this one -- see commit 48e12913. The index AM API stuff is
tricky.

When developing this feature, in an old version patch, we used to set
invalid values to both btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in btbulkdelete() to reset these
values. But we decided to set valid values to both even in
btbulkdelete(). I believe that decision was correct in terms of
btm_oldest_btpo_xact because with the old version patch we will do an
unnecessary index scan during btvacuumcleanup(). But it’s wrong in
terms of btm_last_cleanup_num_heap_tuples, as you pointed out.

Right.

This bug would make the check of vacuum_cleanup_index_scale_factor
untrust. So I think it’s better to backpatch but I think we need to
note that to fix this issue properly, in a case where a vacuum called
btbulkdelete() earlier, probably we should update only
btm_oldest_btpo_xact in btbulkdelete() and then update
btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). In this case,
we don’t know the oldest btpo.xact among the deleted pages in
btvacuumcleanup(). This means that we would need to update the meta
page twice, leading to WAL logging twice. Since we already could
update the meta page more than once when a vacuum calls btbulkdelete()
multiple times I think it would not be a problem, though.

I agree that that approach is fine. Realistically, we won't even have
to update the metapage twice in most cases. Because most indexes never
have even one page deletion anyway.

As I mentioned above, we might need to consider how btbulkdelete() can
tell btvacuumcleanup() btm_last_cleanup_num_delpages in a case where a
vacuum called btbulkdelete earlier. During parallel vacuum, two
different processes could do btbulkdelete() and btvacuumcleanup()
respectively. Updating those values separately in those callbacks
would be straightforward.

I don't see why it should be a problem for my patch/Postgres 14,
because we don't have the same btpo.xact/oldestBtpoXact issue that the
original Postgres 11 commit dealt with. The patch determines a value
for btm_last_cleanup_num_delpages (which I call
pages_deleted_not_free) by subtracting fields from the bulk delete
stats: we just use "stats->pages_deleted - stats->pages_free".

Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
entitled to rely on the bulk delete stats being set in the way I've
described? I assumed that that was okay in general, but I haven't
tested parallel VACUUM specifically. Will parallel VACUUM really fail
to ensure that values in bulk stats fields (like pages_deleted and
pages_free) get set correctly for amvacuumcleanup() callbacks?

--
Peter Geoghegan

#20

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#19)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:

Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
entitled to rely on the bulk delete stats being set in the way I've
described? I assumed that that was okay in general, but I haven't
tested parallel VACUUM specifically. Will parallel VACUUM really fail
to ensure that values in bulk stats fields (like pages_deleted and
pages_free) get set correctly for amvacuumcleanup() callbacks?

I tested the pages_deleted_not_free stuff with a version of my patch
that consistently calls _bt_update_meta_cleanup_info() during
btvacuumcleanup(), and never during btbulkdelete(). And it works just
fine -- including with parallel VACUUM.

Evidently my understanding of what btvacuumcleanup() (or any other
amvacuumcleanup() routine) can expect from bulk delete stats was
correct. It doesn't matter whether or not parallel VACUUM happens to
be involved -- it works just as well.

This is good news, since of course it means that it's okay to stick to
the simple approach of calculating pages_deleted_not_free. Passing
pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to
_bt_update_meta_cleanup_info() during btvacuumcleanup() works just as
well when combined with my fix for the the
"IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()"
bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug
creates no new problems for my patch. There is still no need to think
about when or how the relevant bulk delete fields (pages_deleted and
pages_free) were set. And it doesn't matter whether or not parallel
VACUUM is involved.

(Of course it's also true that we can't do that on the backbranches.
Purely because we must worry about btpo.xact/oldestBtpoXact on the
backbranches. We'll probably have to teach the code in released
versions to set btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in separate calls -- since there is
no easy way to "send" the oldestBtpoXact value determined during a
btbulkdelete() to a later corresponding btvacuumcleanup(). That's a
bit of a kludge, but I'm not worried about it.)

--
Peter Geoghegan

#21

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#20)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Feb 17, 2021 at 5:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Feb 16, 2021 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:

Isn't btvacuumcleanup() (or any other amvacuumcleanup() routine)
entitled to rely on the bulk delete stats being set in the way I've
described? I assumed that that was okay in general, but I haven't
tested parallel VACUUM specifically. Will parallel VACUUM really fail
to ensure that values in bulk stats fields (like pages_deleted and
pages_free) get set correctly for amvacuumcleanup() callbacks?

I tested the pages_deleted_not_free stuff with a version of my patch
that consistently calls _bt_update_meta_cleanup_info() during
btvacuumcleanup(), and never during btbulkdelete(). And it works just
fine -- including with parallel VACUUM.

Evidently my understanding of what btvacuumcleanup() (or any other
amvacuumcleanup() routine) can expect from bulk delete stats was
correct. It doesn't matter whether or not parallel VACUUM happens to
be involved -- it works just as well.

Yes, you're right. I missed that pages_deleted_not_free is calculated
by (stats->pages_deleted - stats->pages_free) where both are in
IndexBulkDeleteResult.

This is good news, since of course it means that it's okay to stick to
the simple approach of calculating pages_deleted_not_free. Passing
pages_deleted_not_free (a.k.a. btm_last_cleanup_num_delpages) to
_bt_update_meta_cleanup_info() during btvacuumcleanup() works just as
well when combined with my fix for the the
"IndexVacuumInfo.num_heap_tuples is inaccurate during btbulkdelete()"
bug. That approach to fixing the IndexVacuumInfo.num_heap_tuples bug
creates no new problems for my patch. There is still no need to think
about when or how the relevant bulk delete fields (pages_deleted and
pages_free) were set. And it doesn't matter whether or not parallel
VACUUM is involved.

Agreed.

(Of course it's also true that we can't do that on the backbranches.
Purely because we must worry about btpo.xact/oldestBtpoXact on the
backbranches. We'll probably have to teach the code in released
versions to set btm_oldest_btpo_xact and
btm_last_cleanup_num_heap_tuples in separate calls -- since there is
no easy way to "send" the oldestBtpoXact value determined during a
btbulkdelete() to a later corresponding btvacuumcleanup(). That's a
bit of a kludge, but I'm not worried about it.)

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#22

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#16)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 16, 2021 at 12:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

Great!

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

I was just going to use work_mem. This should be okay. Note that
CREATE INDEX uses an additional work_mem allocation when building a
unique index, for the second spool/tuplesort. That seems like a
precedent that I can follow here.

Right now the BTPendingRecycle struct the patch uses to store
information about a page that the current VACUUM deleted (and may yet
be able to place in the FSM) are each 16 bytes (including alignment
overhead). I could probably make them smaller with a little work, but
even now that's quite small. Even with the default 4MiB work_mem
setting we can fit information about 262144 pages all at once. That's
2GiB worth of deleted index pages, which is generally much more than
we'll need.

Cool.

Yeah, increasing the threshold would solve the problem in most cases.
Given that nbtree index page deletion is unlikely to happen in
practice, having the threshold 5% or 10% seems to avoid the problem in
nearly 100% of cases, I think.

Of course it all depends on workload/index characteristics, in the
end. It is very rare to delete a percentage of the index that exceeds
autovacuum_vacuum_scale_factor -- that's the important thing here IMV.

Another idea I come up with (maybe on top of above your idea) is to
change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
btpo.xact XID among all deleted pages when the total amount of deleted
pages exceeds 2% of index. That way, we surely can recycle more than
2% of index when the XID becomes older than the global xmin.

You could make my basic approach to recycling deleted pages earlier
(ideally at the end of the same btvacuumscan() that deleted the pages
in the first place) more sophisticated in a variety of ways. These are
all subject to diminishing returns, though.

I've already managed to recycle close to 100% of all B-Tree pages
during the same VACUUM with a very simple approach -- at least if we
assume BenchmarkSQL is representative. It is hard to know how much
more effort can be justified. To be clear, I'm not saying that an
improved design cannot be justified now or in the future (BenchmarkSQL
is not like many workloads that people use Postgres for). I'm just
saying that I *don't know* where to draw the line. Any particular
place that we draw the line feels a little arbitrary to me. This
includes my own choice of the work_mem-limited BTPendingRecycle array.
My patch currently works that way because it's simple -- no other
reason.

Any scheme to further improve the "work_mem-limited BTPendingRecycle
array" design from my patch boils down to this: A new approach that
makes recycling of any remaining deleted pages take place "before too
long": After the end of the btvacuumscan() BTPendingRecycle array
stuff (presumably that didn't work out in cases where an improved
approach matters), but before the next VACUUM takes place (since that
will do the required recycling anyway, unless it's unable to do any
work at all, in which case it hardly matters).

I agreed with this direction.

Here are two ideas of
my own in this same class as your idea:

1. Remember to do some of the BTPendingRecycle array FSM processing
stuff in btvacuumcleanup() -- defer some of the recycling of pages
recorded in BTPendingRecycle entries (paged deleted during
btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.

Right now btvacuumcleanup() will always do nothing when btbulkdelete()
was called earlier. But that's just a current nbtree convention, and
is no reason to not do this (we don't have to scan the index again at
all). The advantage of teaching btvacuumcleanup() to do this is that
it delays the "BTPendingRecycle array FSM processing" stuff until the
last moment that it is still easy to use the in-memory array (because
we haven't freed it yet). In general, doing it later makes it more
likely that we'll successfully recycle the pages. Though in general it
might not make any difference -- so we're still hoping that the
workload allows us to recycle everything we deleted, without making
the design much more complicated than what I posted already.

(BTW I see that you reviewed commit 4e514c61, so you must have thought
about the trade-off between doing deferred recycling in
amvacuumcleanup() vs ambulkdelete(), when to call
IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
implement this idea while calling IndexFreeSpaceMapVacuum() during
both btvacuumcleanup() and btbulkdelete(), so that we get the best of
both worlds -- fast recycling *and* more delayed processing that is
more likely to ultimately succeed.)

I think this idea 1 also needs to serialize BTPendingRecycle array
somewhere to pass it to a parallel vacuum worker in parallel vacuum
case.

Delaying the "BTPendingRecycle array FSM processing" stuff until
btvacuumcleanup() is a good idea. But I think it's a relatively rare
case in practice where index vacuum runs more than once (i.g., using
up maintenance_work_mem). So considering the development cost of
serializing BTPendingRecycle array and index AM API changes,
attempting to recycle the deleted pages at the end of btvacuumscan()
would be a balanced strategy.

2. Remember/serialize the BTPendingRecycle array when we realize that
we cannot put all recyclable pages in the FSM at the end of the
current btvacuumscan(), and then use an autovacuum work item to
process them before too long -- a call to AutoVacuumRequestWork()
could even serialize the data on disk.

Idea 2 has the advantage of allowing retries -- eventually it will be
safe to recycle the pages, if we just wait long enough.

This is a good idea too. Perhaps autovacuum needs to end up with an
error to retry later again in case where it could not recycle all
deleted pages.

I have thought too about the idea to store pending-recycle pages
somewhere to avoid index scan when we do the XID-is-recyclable check.
My idea was to store them to btree pages dedicated for this purpose
linked from the meta page but I prefer your idea.

Anyway, I'm probably not going to pursue either of the 2 ideas for
Postgres 14. I'm mentioning these ideas now because the trade-offs
show that there is no perfect design for this deferring recycling
stuff. Whatever we do, we should accept that there is no perfect
design.

Actually, there is one more reason why I bring up idea 1 now: I want
to hear your thoughts on the index AM API questions now, which idea 1
touches on. Ideally all of the details around the index AM VACUUM APIs
(i.e. when and where the extra work happens -- btvacuumcleanup() vs
btbulkdelete()) won't need to change much in the future. I worry about
getting this index AM API stuff right, at least a little.

After introducing parallel vacuum, index AMs are not able to pass
arbitary information taken in ambulkdlete() to amvacuumcleanup(), like
old gist index code does. If there is a good use case where needs to
pass arbitary information to amvacuumcleanup(), I think it'd be a good
idea to add an index AM API so that parallel vacuum serialize it and
tells another parallel vacuum worker. But, as I mentinoed above, given
that vacuum calls ambulkdelete() only once in most cases and I think
we’d like to improve how to store TIDs in maintenance_work_mem space
(discussed a little on thread[1]/messages/by-id/CA+fd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A@mail.gmail.com), delaying "the BTPendingRecycle
array FSM processing stuff in btvacuumcleanup()” would not be a good
usecase.

Also, maybe we can record deleted pages to FSM even without deferring
and check it when re-using. That is, when we get a free page from FSM
we check if the page is really recyclable (maybe _bt_getbuf() already
does this?). IOW, a deleted page can be recycled only when it's
requested to be reused. If btpo.xact is 64-bit XID we never need to
worry about the case where a deleted page never be requested to be
reused.

I've thought about that too (both now and in the past). You're right
about _bt_getbuf() -- it checks the XID, at least on the master
branch. I took that XID check out in v4 of the patch, but I am now
starting to have my doubts about that particular choice. (I'm probably
going to restore the XID check in _bt_getbuf in v5 of the patch.)

I took the XID-is-recyclable check out in v4 of the patch because it
might leak pages in rare cases -- which is not a new problem.
_bt_getbuf() currently has a remarkably relaxed attitude about leaking
pages from the FSM (it is more relaxed about it than I am, certainly)
-- but why should we just accept leaking pages like that? My new
doubts about it are non-specific, though. We know that the FSM isn't
crash safe -- but I think that that reduces to "practically speaking,
we can never 100% trust the FSM". Which makes me nervous. I worry that
the FSM can do something completely evil and crazy in rare cases.

It's not just crash safety. The FSM's fsm_search_avail() function
currently changes the fp_next_slot field with only a shared buffer
lock held. It's an int, which is supposed to "be atomic on most
platforms". But we should be using real atomic ops. So the FSM is
generally...kind of wonky.

In an ideal world, nbtree page deletion + recycling would have crash
safety built in. I don't think that it makes sense to not have free
space management without crash safety in the case of index AMs,
because it's just not worth it with whole-page units of free space
(heapam is another story). A 100% crash-safe design would naturally
shift the problem of nbtree page recycle safety from the
producer/VACUUM side, to the consumer/_bt_getbuf() side, which I agree
would be a real improvement. But these long standing FSM issues are
not going to change for Postgres 14. And so changing _bt_getbuf() to
do clever things with XIDs won't be possible for Postgres 14 IMV.

Agreed. Thanks for your explanation.

Regards,

[1]: /messages/by-id/CA+fd4k76j8jKzJzcx8UqEugvayaMSnQz0iLUt_XgBp-_-bd22A@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#23

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#22)

3 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Thanks for your explanation.

Attached is v5, which has some of the changes I talked about. Changes
from v4 include:

* Now only updates metapage during btvacuumcleanup() in the first
patch, which is enough to fix the existing
IndexVacuumInfo.num_heap_tuples issue.

* Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia.

* The second patch in the series now respects work_mem when sizing the
BTPendingRecycle array.

* New enhancement to the XID GlobalVisCheckRemovableFullXid() test
used in the second patch, to allow it to recycle even more pages.
(Still unsure of some of the details here.)

I would like to commit the first patch in a few days -- I refer to the
big patch that makes deleted page XIDs 64-bit/full. Can you take a
look at that one, Masahiko? That would be helpful. I can produce a bug
fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I
think that that should be written after the first patch is finalized
and committed.

The second patch (the new recycling optimization) will require more
work and testing.

Thanks!
--
Peter Geoghegan

Attachments:

v5-0002-Recycle-pages-deleted-during-same-VACUUM.patchapplication/octet-stream; name=v5-0002-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 435f9b4d82e70665603a7c0eb19c9065010eb5d0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 7 Feb 2021 19:24:03 -0800
Subject: [PATCH v5 2/3] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         |  38 +++++++++-
 src/backend/access/nbtree/README    |  31 ++++++++
 src/backend/access/nbtree/nbtpage.c |  90 ++++++++++++++++------
 src/backend/access/nbtree/nbtree.c  | 111 ++++++++++++++++++++++++----
 4 files changed, 229 insertions(+), 41 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 901b6f4dc8..5c197fc5c1 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -280,7 +280,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -313,6 +314,39 @@ BTPageIsRecyclable(Page page)
 	return false;
 }
 
+/*
+ * BTVacState is nbtree.c state used during VACUUM.  It is exported for use by
+ * page deletion related code in nbtpage.c.
+ */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
+typedef struct BTVacState
+{
+	/*
+	 * VACUUM operation state
+	 */
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
+	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
+} BTVacState;
+
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
  *	page.  The high key is not a tuple that is used to visit the heap.  It is
@@ -1182,7 +1216,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern void _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8ae16428d7..55395c87c1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BTVacState *vstate);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1761,20 +1761,22 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Maintains bulk delete stats for caller, which are taken from vstate.  We
+ * need to cooperate closely with caller here so that whole VACUUM operation
+ * reliably avoids any double counting of subsidiary-to-leafbuf pages that we
+ * delete in passing.  If such pages happen to be from a block number that is
+ * ahead of the current scanblkno position, then caller is expected to count
+ * them directly later on.  It's simpler for us to understand caller's
+ * requirements than it would be for caller to understand when or how a
+ * deleted page became deleted after the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
-uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+void
+_bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 {
-	uint32		ndeleted = 0;
 	BlockNumber rightsib;
 	bool		rightsib_empty;
 	Page		page;
@@ -1782,7 +1784,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are added to bulk delete stat's pages_deleted
+	 * count.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -1844,7 +1847,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 										 RelationGetRelationName(rel))));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1874,7 +1877,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			Assert(!P_ISHALFDEAD(opaque));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1923,8 +1926,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
-					Assert(ndeleted == 0);
-					return ndeleted;
+					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
@@ -1965,7 +1967,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -1980,7 +1982,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, vstate))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -1991,7 +1993,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				 * lock and pin on leafbuf for us.
 				 */
 				Assert(false);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -2027,8 +2029,6 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
-
-	return ndeleted;
 }
 
 /*
@@ -2263,9 +2263,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BTVacState *vstate)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	IndexBulkDeleteResult *stats = vstate->stats;
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
 	BlockNumber target;
@@ -2673,12 +2674,53 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
-	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * Maintain pages_deleted in a way that takes into account how
+	 * btvacuumpage() will count deleted pages that have yet to become
+	 * scanblkno -- only count page when it's not going to get that treatment
+	 * later on.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		stats->pages_deleted++;
+
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b5a674d9e0..b18022936b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,23 +34,13 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 
 
-/* Working state needed by btvacuumpage */
-typedef struct
-{
-	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
-	IndexBulkDeleteCallback callback;
-	void	   *callback_state;
-	BTCycleId	cycleid;
-	MemoryContext pagedelcontext;
-} BTVacState;
-
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
@@ -868,6 +860,68 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -953,6 +1007,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1041,6 +1103,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1109,7 +1181,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
@@ -1448,12 +1531,10 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel maintains the bulk delete stats on our behalf
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		_bt_pagedel(rel, buf, vstate);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
-- 
2.27.0

v5-0003-Show-pages-newly-deleted-in-VACUUM-VERBOSE-output.patchapplication/octet-stream; name=v5-0003-Show-pages-newly-deleted-in-VACUUM-VERBOSE-output.patchDownload

From ea89b5f9892a7c42c0b23f78545c4201024813f4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 14 Feb 2021 17:38:34 -0800
Subject: [PATCH v5 3/3] Show "pages newly deleted" in VACUUM VERBOSE output.

Teach VACUUM VERBOSE to distinguish between pages that were deleted by
the current VACUUM operation and all deleted pages in the index (without
regard to when or how they became deleted).  The latter metric has been
output by VACUUM verbose for many years.  Showing both together seems
far more informative.

The new VACUUM VERBOSE field will be helpful to both PostgreSQL users
and PostgreSQL developers that want to understand when and how page
deletions are executed, and when and how free pages can actually be
recycled.
---
 src/include/access/genam.h            | 11 ++++++++---
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  | 19 ++++++++++++++++---
 src/backend/access/heap/vacuumlazy.c  |  4 +++-
 src/backend/access/nbtree/nbtpage.c   |  4 ++++
 src/backend/access/nbtree/nbtree.c    | 17 +++++++++++------
 src/backend/access/spgist/spgvacuum.c |  1 +
 7 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ffa1a4c80d..13971c8b2a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -63,8 +63,12 @@ typedef struct IndexVacuumInfo
  * of which this is just the first field; this provides a way for ambulkdelete
  * to communicate additional private data to amvacuumcleanup.
  *
- * Note: pages_deleted and pages_free refer to free space within the index
- * file.  Some index AMs may compute num_index_tuples by reference to
+ * Note: pages_newly_deleted is the number of pages in the index that were
+ * deleted by the current vacuum operation.  pages_deleted and pages_free
+ * refer to free space within the index file (and so pages_deleted must be >=
+ * pages_newly_deleted).
+ *
+ * Note: Some index AMs may compute num_index_tuples by reference to
  * num_heap_tuples, in which case they should copy the estimated_count field
  * from IndexVacuumInfo.
  */
@@ -74,7 +78,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_newly_deleted;	/* # pages marked deleted by us  */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a0453b36cd..a276eb020b 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -231,6 +231,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
+	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ddecb8ab18..0663193531 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -133,9 +133,21 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MemoryContext oldctx;
 
 	/*
-	 * Reset counts that will be incremented during the scan; needed in case
-	 * of multiple scans during a single VACUUM command.
+	 * Reset fields that track information about the entire index now.  This
+	 * avoids double-counting in the case where a single VACUUM command
+	 * requires multiple scans of the index.
+	 *
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to gistvacuumscan().
+	 *
+	 * (Note that pages_free is treated as state about the whole index, not
+	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
+	 * calls are idempotent, and get repeated for the same deleted pages in
+	 * some scenarios.  The point for us is to track the number of recyclable
+	 * pages in the index at the end of the VACUUM command.)
 	 */
+	stats->num_pages = 0;
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
@@ -281,8 +293,8 @@ restart:
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->stats->pages_free++;
 		vstate->stats->pages_deleted++;
+		vstate->stats->pages_free++;
 	}
 	else if (GistPageIsDeleted(page))
 	{
@@ -636,6 +648,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* mark the page as deleted */
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
+	stats->pages_newly_deleted++;
 	stats->pages_deleted++;
 
 	/* remove the downlink from the parent */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bb78162f5..d8f847b0e6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,9 +2521,11 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages were newly deleted.\n"
+						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
+						   (*stats)->pages_newly_deleted,
 						   (*stats)->pages_deleted, (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 55395c87c1..9de24b7a54 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2674,11 +2674,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
+	 * Maintain pages_newly_deleted, which is simply the number of pages
+	 * deleted by the ongoing VACUUM operation.
+	 *
 	 * Maintain pages_deleted in a way that takes into account how
 	 * btvacuumpage() will count deleted pages that have yet to become
 	 * scanblkno -- only count page when it's not going to get that treatment
 	 * later on.
 	 */
+	stats->pages_newly_deleted++;
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b18022936b..8bf5fc439b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -873,6 +873,9 @@ _bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
 	IndexBulkDeleteResult *stats = vstate->stats;
 	Relation	heapRel;
 
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
 	/*
 	 * Recompute VACUUM XID boundaries.
 	 *
@@ -1075,9 +1078,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * avoids double-counting in the case where a single VACUUM command
 	 * requires multiple scans of the index.
 	 *
-	 * Avoid resetting the tuples_removed field here, since it tracks
-	 * information about the VACUUM command, and so must last across each call
-	 * to btvacuumscan().
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to btvacuumscan().
 	 *
 	 * (Note that pages_free is treated as state about the whole index, not
 	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
@@ -1318,8 +1321,8 @@ backtrack:
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
-		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * Half-dead leaf page.  Try to delete now.  Might end up incrementing
+		 * pages_newly_deleted/pages_deleted inside _bt_pagedel.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1531,7 +1534,9 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * _bt_pagedel maintains the bulk delete stats on our behalf
+		 * _bt_pagedel maintains the bulk delete stats on our behalf;
+		 * pages_newly_deleted and pages_deleted are likely to be incremented
+		 * during call
 		 */
 		Assert(blkno == scanblkno);
 		_bt_pagedel(rel, buf, vstate);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

v5-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchapplication/octet-stream; name=v5-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patchDownload

From fcfe32c6944436c8c4c9690fe05438708c24f5cf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v5 1/3] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing within GiST.

Stop storing an XID for that tracks the oldest safexid across all
deleted pages in an index altogether.  There is no longer any point in
doing this.  It only ever made sense when bpto.xact fields could
wraparound.

The old btm_oldest_btpo_xact metapage field has been repurposed in a way
that preserves on-disk compatibility for pg_upgrade.  Rename this uint32
field, and use it to store the number of deleted pages that we expect to
be able to recycle during the next btvacuumcleanup() that actually scans
the index.  This approach is a little unorthodox, but we were already
using btm_oldest_btpo_xact (now called btm_last_cleanup_num_delpages) in
approximately the same way. And in exactly the same place: inside the
_bt_vacuum_needs_cleanup() function.

The general assumption is that we ought to be able to recycle however
many pages btm_last_cleanup_num_delpages indicates by deciding to scan
the index during a btvacuumcleanup() call (_bt_vacuum_needs_cleanup()'s
decision).  Note that manually issued VACUUMs won't be able to recycle
btm_last_cleanup_num_delpages pages (and _bt_vacuum_needs_cleanup()
won't instruct btvacuumcleanup() to skip scanning the index) unless at
least one XID is consumed between VACUUMs (this limitation is already
documented in the nbtree README).

Store btm_last_cleanup_num_delpages and btm_last_cleanup_num_heap_tuples
at the point that we reach btvacuumcleanup() in all cases, including
cases where VACUUM ends up calling btbulkdelete().  We thereby ensure
that btm_last_cleanup_num_heap_tuples matches pg_class.reltuples for the
heap relation at the point that the VACUUM operation finishes.  The old
approach often led to the value matching pg_class.reltuples _before_ the
VACUUM operation began, which was not the intended behavior (this
happened only in the common case where btbulkdelete() is called, so at
the very least the old approach derived the value in an inconsistent and
unprincipled way).  This fixes a bug in commit 857f9c36, which added
skipping index scans during VACUUM in the first place.

Bump XLOG_PAGE_MAGIC due to WAL record switch over to full XIDs.
---
 src/include/access/nbtree.h                   | 124 +++++++--
 src/include/access/nbtxlog.h                  |  28 +-
 src/include/storage/standby.h                 |   2 +
 src/backend/access/gist/gistxlog.c            |  24 +-
 src/backend/access/nbtree/nbtinsert.c         |  24 +-
 src/backend/access/nbtree/nbtpage.c           | 239 ++++++++----------
 src/backend/access/nbtree/nbtree.c            | 217 +++++++++-------
 src/backend/access/nbtree/nbtsearch.c         |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   2 +-
 src/backend/access/nbtree/nbtxlog.c           |  47 ++--
 src/backend/access/rmgrdesc/nbtdesc.c         |  17 +-
 src/backend/storage/ipc/standby.c             |  28 ++
 contrib/amcheck/verify_nbtree.c               |  90 ++++---
 contrib/pageinspect/btreefuncs.c              |  69 +++--
 contrib/pageinspect/expected/btree.out        |  22 +-
 contrib/pageinspect/pageinspect--1.8--1.9.sql |  19 +-
 contrib/pgstattuple/pgstatindex.c             |   8 +-
 doc/src/sgml/pageinspect.sgml                 |  22 +-
 18 files changed, 595 insertions(+), 393 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..901b6f4dc8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -37,8 +37,9 @@ typedef uint16 BTCycleId;
  *
  *	In addition, we store the page's btree level (counting upwards from
  *	zero at a leaf page) as well as some flag bits indicating the page type
- *	and status.  If the page is deleted, we replace the level with the
- *	next-transaction-ID value indicating when it is safe to reclaim the page.
+ *	and status.  If the page is deleted, a BTDeletedPageData struct is stored
+ *	in the page's tuple area, while a standard BTPageOpaqueData struct is
+ *	stored in the page special area.
  *
  *	We also store a "vacuum cycle ID".  When a page is split while VACUUM is
  *	processing the index, a nonzero value associated with the VACUUM run is
@@ -52,17 +53,17 @@ typedef uint16 BTCycleId;
  *
  *	NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
  *	instead.
+ *
+ *	NOTE: the btpo_level field used to be a union type in order to allow
+ *	deleted pages to store a 32-bit safexid in the same field.  We now store
+ *	64-bit/full safexid values using BTDeletedPageData instead.
  */
 
 typedef struct BTPageOpaqueData
 {
 	BlockNumber btpo_prev;		/* left sibling, or P_NONE if leftmost */
 	BlockNumber btpo_next;		/* right sibling, or P_NONE if rightmost */
-	union
-	{
-		uint32		level;		/* tree level --- zero for leaf pages */
-		TransactionId xact;		/* next transaction ID, if deleted */
-	}			btpo;
+	uint32		btpo_level;		/* tree level --- zero for leaf pages */
 	uint16		btpo_flags;		/* flag bits, see below */
 	BTCycleId	btpo_cycleid;	/* vacuum cycle ID of latest split */
 } BTPageOpaqueData;
@@ -78,6 +79,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
 #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
 #define BTP_HAS_GARBAGE (1 << 6)	/* page has LP_DEAD tuples (deprecated) */
 #define BTP_INCOMPLETE_SPLIT (1 << 7)	/* right sibling's downlink is missing */
+#define BTP_HAS_FULLXID	(1 << 8)	/* contains BTDeletedPageData */
 
 /*
  * The max allowed value of a cycle ID is a bit less than 64K.  This is
@@ -105,10 +107,12 @@ typedef struct BTMetaPageData
 	BlockNumber btm_fastroot;	/* current "fast" root location */
 	uint32		btm_fastlevel;	/* tree level of the "fast" root page */
 	/* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
-	TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
-										 * pages */
-	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
-													 * during last cleanup */
+
+	/* number of deleted, non-recyclable pages during last cleanup */
+	uint32		btm_last_cleanup_num_delpages;
+	/* number of heap tuples during last cleanup */
+	float8		btm_last_cleanup_num_heap_tuples;
+
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
 } BTMetaPageData;
 
@@ -220,6 +224,94 @@ typedef struct BTMetaPageData
 #define P_IGNORE(opaque)		(((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD)) != 0)
 #define P_HAS_GARBAGE(opaque)	(((opaque)->btpo_flags & BTP_HAS_GARBAGE) != 0)
 #define P_INCOMPLETE_SPLIT(opaque)	(((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0)
+#define P_HAS_FULLXID(opaque)	(((opaque)->btpo_flags & BTP_HAS_FULLXID) != 0)
+
+/*
+ * BTDeletedPageData is the page contents of a deleted page
+ */
+typedef struct BTDeletedPageData
+{
+	/* last xid which might land on the page and get confused */
+	FullTransactionId safexid;
+} BTDeletedPageData;
+
+static inline void
+BTPageSetDeleted(Page page, FullTransactionId safexid)
+{
+	BTPageOpaque opaque;
+	PageHeader	header;
+	BTDeletedPageData *contents;
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	header = ((PageHeader) page);
+
+	opaque->btpo_flags &= ~BTP_HALF_DEAD;
+	opaque->btpo_flags |= BTP_DELETED | BTP_HAS_FULLXID;
+	header->pd_lower =
+		MAXALIGN(SizeOfPageHeaderData) + sizeof(BTDeletedPageData);
+	header->pd_upper = header->pd_special;
+
+	/* Set safexid in deleted page */
+	contents = ((BTDeletedPageData *) PageGetContents(page));
+	contents->safexid = safexid;
+}
+
+static inline FullTransactionId
+BTPageGetDeleteXid(Page page)
+{
+	BTPageOpaque opaque;
+	BTDeletedPageData *contents;
+
+	/* We only expect to be called with a deleted page */
+	Assert(!PageIsNew(page));
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Assert(P_ISDELETED(opaque));
+
+	/* pg_upgrade'd deleted page -- must be safe to delete now */
+	if (!P_HAS_FULLXID(opaque))
+		return FirstNormalFullTransactionId;
+
+	/* Get safexid from deleted page */
+	contents = ((BTDeletedPageData *) PageGetContents(page));
+	return contents->safexid;
+}
+
+/*
+ * Is an existing page recyclable?
+ *
+ * This exists to centralize the policy on which deleted pages are now safe to
+ * re-use.
+ *
+ * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
+ * them here (caller is responsible for that case themselves).  Caller might
+ * well need special handling for new pages anyway.
+ */
+static inline bool
+BTPageIsRecyclable(Page page)
+{
+	BTPageOpaque opaque;
+
+	Assert(!PageIsNew(page));
+
+	/* Recycling okay iff page is deleted and safexid is old enough */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	if (P_ISDELETED(opaque))
+	{
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * For that check if the deletion XID could still be visible to
+		 * anyone. If not, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
+	}
+
+	return false;
+}
 
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -962,7 +1054,7 @@ typedef struct BTOptions
 {
 	int32		varlena_header_;	/* varlena header (do not touch directly!) */
 	int			fillfactor;		/* page fill factor in percent (0..100) */
-	/* fraction of newly inserted tuples prior to trigger index cleanup */
+	/* fraction of newly inserted tuples needed to trigger index cleanup */
 	float8		vacuum_cleanup_index_scale_factor;
 	bool		deduplicate_items;	/* Try to deduplicate items? */
 } BTOptions;
@@ -1066,8 +1158,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern void _bt_update_meta_cleanup_info(Relation rel,
-										 TransactionId oldestBtpoXact, float8 numHeapTuples);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+								 float8 num_heap_tuples);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
@@ -1084,15 +1176,13 @@ extern void _bt_unlockbuf(Relation rel, Buffer buf);
 extern bool _bt_conditionallockbuf(Relation rel, Buffer buf);
 extern void _bt_upgradelockbufcleanup(Relation rel, Buffer buf);
 extern void _bt_pageinit(Page page, Size size);
-extern bool _bt_page_recyclable(Page page);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *deletable, int ndeletable,
 								BTVacuumPosting *updatable, int nupdatable);
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
-						  TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..0cec80e5fa 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -13,6 +13,7 @@
 #ifndef NBTXLOG_H
 #define NBTXLOG_H
 
+#include "access/transam.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "storage/off.h"
@@ -52,7 +53,7 @@ typedef struct xl_btree_metadata
 	uint32		level;
 	BlockNumber fastroot;
 	uint32		fastlevel;
-	TransactionId oldest_btpo_xact;
+	uint32		last_cleanup_num_delpages;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
 } xl_btree_metadata;
@@ -187,7 +188,7 @@ typedef struct xl_btree_reuse_page
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } xl_btree_reuse_page;
 
 #define SizeOfBtreeReusePage	(sizeof(xl_btree_reuse_page))
@@ -282,9 +283,12 @@ typedef struct xl_btree_mark_page_halfdead
 #define SizeOfBtreeMarkPageHalfDead (offsetof(xl_btree_mark_page_halfdead, topparent) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  Note we do
- * not store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about deletion of a btree page.  Note that we
+ * only leave behind a small amount of bookkeeping information in deleted
+ * pages (deleted pages must be kept around as tombstones for a while).  It is
+ * convenient for the REDO routine to regenerate its target page from scratch.
+ * This is why WAL record describes certain details that are actually directly
+ * available from the target page.
  *
  * Backup Blk 0: target block being deleted
  * Backup Blk 1: target block's left sibling, if any
@@ -296,20 +300,24 @@ typedef struct xl_btree_unlink_page
 {
 	BlockNumber leftsib;		/* target block's left sibling, if any */
 	BlockNumber rightsib;		/* target block's right sibling */
+	uint32		level;			/* target block's level */
+	FullTransactionId safexid;	/* target block's BTPageSetDeleted() value */
 
 	/*
-	 * Information needed to recreate the leaf page, when target is an
-	 * internal page.
+	 * Information needed to recreate a half-dead leaf page with correct
+	 * topparent link.  The fields are only used when deletion operation's
+	 * target page is an internal page.  REDO routine creates half-dead page
+	 * from scratch to keep things simple (this is the same convenient
+	 * approach used for the target page itself).
 	 */
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
-	BlockNumber topparent;		/* next child down in the subtree */
+	BlockNumber topparent;
 
-	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
 } xl_btree_unlink_page;
 
-#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, topparent) + sizeof(BlockNumber))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 94d33851d0..38fd85a431 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -31,6 +31,8 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+													   RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index f2eda79bc1..1c80eae044 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -394,28 +394,8 @@ gistRedoPageReuse(XLogReaderState *record)
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
-		FullTransactionId nextXid = ReadNextFullTransactionId();
-		uint64		diff;
-
-		/*
-		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
-		 * TransactionIds, so truncate the logged FullTransactionId. If the
-		 * logged value is very old, so that XID wrap-around already happened
-		 * on it, there can't be any snapshots that still see it.
-		 */
-		diff = U64FromFullTransactionId(nextXid) -
-			U64FromFullTransactionId(latestRemovedFullXid);
-		if (diff < MaxTransactionId / 2)
-		{
-			TransactionId latestRemovedXid;
-
-			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
-			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
-												xlrec->node);
-		}
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..1edb9f9579 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1241,7 +1241,7 @@ _bt_insertonpg(Relation rel,
 			metapg = BufferGetPage(metabuf);
 			metad = BTPageGetMeta(metapg);
 
-			if (metad->btm_fastlevel >= opaque->btpo.level)
+			if (metad->btm_fastlevel >= opaque->btpo_level)
 			{
 				/* no update wanted */
 				_bt_relbuf(rel, metabuf);
@@ -1268,7 +1268,7 @@ _bt_insertonpg(Relation rel,
 			if (metad->btm_version < BTREE_NOVAC_VERSION)
 				_bt_upgrademetapage(metapg);
 			metad->btm_fastroot = BufferGetBlockNumber(buf);
-			metad->btm_fastlevel = opaque->btpo.level;
+			metad->btm_fastlevel = opaque->btpo_level;
 			MarkBufferDirty(metabuf);
 		}
 
@@ -1331,7 +1331,7 @@ _bt_insertonpg(Relation rel,
 					xlmeta.level = metad->btm_level;
 					xlmeta.fastroot = metad->btm_fastroot;
 					xlmeta.fastlevel = metad->btm_fastlevel;
-					xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+					xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 					xlmeta.last_cleanup_num_heap_tuples =
 						metad->btm_last_cleanup_num_heap_tuples;
 					xlmeta.allequalimage = metad->btm_allequalimage;
@@ -1537,7 +1537,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
 	lopaque->btpo_prev = oopaque->btpo_prev;
 	/* handle btpo_next after rightpage buffer acquired */
-	lopaque->btpo.level = oopaque->btpo.level;
+	lopaque->btpo_level = oopaque->btpo_level;
 	/* handle btpo_cycleid after rightpage buffer acquired */
 
 	/*
@@ -1722,7 +1722,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	ropaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = oopaque->btpo_next;
-	ropaque->btpo.level = oopaque->btpo.level;
+	ropaque->btpo_level = oopaque->btpo_level;
 	ropaque->btpo_cycleid = lopaque->btpo_cycleid;
 
 	/*
@@ -1950,7 +1950,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
 
-		xlrec.level = ropaque->btpo.level;
+		xlrec.level = ropaque->btpo_level;
 		/* See comments below on newitem, orignewitem, and posting lists */
 		xlrec.firstrightoff = firstrightoff;
 		xlrec.newitemoff = newitemoff;
@@ -2142,7 +2142,7 @@ _bt_insert_parent(Relation rel,
 					 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 			/* Find the leftmost page at the next level up */
-			pbuf = _bt_get_endpoint(rel, opaque->btpo.level + 1, false, NULL);
+			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
 			/* Set up a phony stack entry pointing there */
 			stack = &fakestack;
 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2480,15 +2480,15 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 	rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 	rootopaque->btpo_flags = BTP_ROOT;
-	rootopaque->btpo.level =
-		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
+	rootopaque->btpo_level =
+		((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo_level + 1;
 	rootopaque->btpo_cycleid = 0;
 
 	/* update metapage data */
 	metad->btm_root = rootblknum;
-	metad->btm_level = rootopaque->btpo.level;
+	metad->btm_level = rootopaque->btpo_level;
 	metad->btm_fastroot = rootblknum;
-	metad->btm_fastlevel = rootopaque->btpo.level;
+	metad->btm_fastlevel = rootopaque->btpo_level;
 
 	/*
 	 * Insert the left page pointer into the new root page.  The root page is
@@ -2548,7 +2548,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.level = metad->btm_level;
 		md.fastroot = rootblknum;
 		md.fastlevel = metad->btm_level;
-		md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+		md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8c326a4774..8ae16428d7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -37,7 +37,7 @@
 
 static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
-							   TransactionId latestRemovedXid);
+							   FullTransactionId latestRemovedFullXid);
 static void _bt_delitems_delete(Relation rel, Buffer buf,
 								TransactionId latestRemovedXid,
 								OffsetNumber *deletable, int ndeletable,
@@ -50,7 +50,6 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 TransactionId *oldestBtpoXact,
 									 uint32 *ndeleted);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
@@ -78,7 +77,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_level = level;
 	metad->btm_fastroot = rootbknum;
 	metad->btm_fastlevel = level;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
 
@@ -118,7 +117,7 @@ _bt_upgrademetapage(Page page)
 
 	/* Set version number and fill extra fields added into version 3 */
 	metad->btm_version = BTREE_NOVAC_VERSION;
-	metad->btm_oldest_btpo_xact = InvalidTransactionId;
+	metad->btm_last_cleanup_num_delpages = 0;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
@@ -169,35 +168,61 @@ _bt_getmeta(Relation rel, Buffer metabuf)
 }
 
 /*
- *	_bt_update_meta_cleanup_info() -- Update cleanup-related information in
- *									  the metapage.
+ *	_bt_set_cleanup_info() -- Update metapage for btvacuumcleanup().
  *
- *		This routine checks if provided cleanup-related information is matching
- *		to those written in the metapage.  On mismatch, metapage is overwritten.
+ *		This routine is called at the end of each VACUUM's btvacuumcleanup()
+ *		call.  Its purpose is to maintain the metapage fields that are used by
+ *		_bt_vacuum_needs_cleanup() to decide whether or not a btvacuumscan()
+ *		call should go ahead for an entire VACUUM operation.
+ *
+ *		See btvacuumcleanup() and _bt_vacuum_needs_cleanup() for details of
+ *		the two fields that we maintain here.
+ *
+ *		The information that we maintain for btvacuumcleanup() describes the
+ *		state of the index (as well as the table it indexes) just _after_ the
+ *		ongoing VACUUM operation.  The next _bt_vacuum_needs_cleanup() call
+ *		will consider the information we saved for it during the next VACUUM
+ *		operation (assuming that there will be no btbulkdelete() call during
+ *		the next VACUUM operation -- if there is then the question of skipping
+ *		btvacuumscan() doesn't even arise).
  */
 void
-_bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
-							 float8 numHeapTuples)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
+					 float8 num_heap_tuples)
 {
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		needsRewrite = false;
+	bool		rewrite = false;
 	XLogRecPtr	recptr;
 
-	/* read the metapage and check if it needs rewrite */
+	/*
+	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
+	 * field started out as a TransactionId field called btm_oldest_btpo_xact.
+	 * Both "versions" are just uint32 fields.  It was convenient to repurpose
+	 * the field when we began to use 64-bit XIDs in deleted pages.
+	 *
+	 * It's possible that a pg_upgrade'd database will contain an XID value in
+	 * what is now recognized as the metapage's btm_last_cleanup_num_delpages
+	 * field.  _bt_vacuum_needs_cleanup() may even believe that this value
+	 * indicates that there are lots of pages that it needs to recycle, when
+	 * in reality there are only one or two.  The worst that can happen is
+	 * that there will be a call to btvacuumscan a little earlier, which will
+	 * set btm_last_cleanup_num_delpages to a sane value when we're called.
+	 */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	/* outdated version of metapage always needs rewrite */
+	/* Always dynamically upgrade index/metapage when BTREE_MIN_VERSION */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
-		needsRewrite = true;
-	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
-		needsRewrite = true;
+		rewrite = true;
+	else if (metad->btm_last_cleanup_num_delpages != num_delpages)
+		rewrite = true;
+	else if (metad->btm_last_cleanup_num_heap_tuples != num_heap_tuples)
+		rewrite = true;
 
-	if (!needsRewrite)
+	if (!rewrite)
 	{
 		_bt_relbuf(rel, metabuf);
 		return;
@@ -214,8 +239,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		_bt_upgrademetapage(metapg);
 
 	/* update cleanup-related information */
-	metad->btm_oldest_btpo_xact = oldestBtpoXact;
-	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_cleanup_num_delpages = num_delpages;
+	metad->btm_last_cleanup_num_heap_tuples = num_heap_tuples;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -232,8 +257,8 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.level = metad->btm_level;
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
-		md.oldest_btpo_xact = oldestBtpoXact;
-		md.last_cleanup_num_heap_tuples = numHeapTuples;
+		md.last_cleanup_num_delpages = num_delpages;
+		md.last_cleanup_num_heap_tuples = num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -244,6 +269,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	}
 
 	END_CRIT_SECTION();
+
 	_bt_relbuf(rel, metabuf);
 }
 
@@ -316,7 +342,7 @@ _bt_getroot(Relation rel, int access)
 		 * because that's not set in a "fast root".
 		 */
 		if (!P_IGNORE(rootopaque) &&
-			rootopaque->btpo.level == rootlevel &&
+			rootopaque->btpo_level == rootlevel &&
 			P_LEFTMOST(rootopaque) &&
 			P_RIGHTMOST(rootopaque))
 		{
@@ -377,7 +403,7 @@ _bt_getroot(Relation rel, int access)
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 		rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
 		rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
-		rootopaque->btpo.level = 0;
+		rootopaque->btpo_level = 0;
 		rootopaque->btpo_cycleid = 0;
 		/* Get raw page pointer for metapage */
 		metapg = BufferGetPage(metabuf);
@@ -393,7 +419,7 @@ _bt_getroot(Relation rel, int access)
 		metad->btm_level = 0;
 		metad->btm_fastroot = rootblkno;
 		metad->btm_fastlevel = 0;
-		metad->btm_oldest_btpo_xact = InvalidTransactionId;
+		metad->btm_last_cleanup_num_delpages = 0;
 		metad->btm_last_cleanup_num_heap_tuples = -1.0;
 
 		MarkBufferDirty(rootbuf);
@@ -416,7 +442,7 @@ _bt_getroot(Relation rel, int access)
 			md.level = 0;
 			md.fastroot = rootblkno;
 			md.fastlevel = 0;
-			md.oldest_btpo_xact = InvalidTransactionId;
+			md.last_cleanup_num_delpages = 0;
 			md.last_cleanup_num_heap_tuples = -1.0;
 			md.allequalimage = metad->btm_allequalimage;
 
@@ -481,11 +507,10 @@ _bt_getroot(Relation rel, int access)
 			rootblkno = rootopaque->btpo_next;
 		}
 
-		/* Note: can't check btpo.level on deleted pages */
-		if (rootopaque->btpo.level != rootlevel)
+		if (rootopaque->btpo_level != rootlevel)
 			elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 				 rootblkno, RelationGetRelationName(rel),
-				 rootopaque->btpo.level, rootlevel);
+				 rootopaque->btpo_level, rootlevel);
 	}
 
 	/*
@@ -585,11 +610,10 @@ _bt_gettrueroot(Relation rel)
 		rootblkno = rootopaque->btpo_next;
 	}
 
-	/* Note: can't check btpo.level on deleted pages */
-	if (rootopaque->btpo.level != rootlevel)
+	if (rootopaque->btpo_level != rootlevel)
 		elog(ERROR, "root page %u of index \"%s\" has level %u, expected %u",
 			 rootblkno, RelationGetRelationName(rel),
-			 rootopaque->btpo.level, rootlevel);
+			 rootopaque->btpo_level, rootlevel);
 
 	return rootbuf;
 }
@@ -762,7 +786,8 @@ _bt_checkpage(Relation rel, Buffer buf)
  * Log the reuse of a page from the FSM.
  */
 static void
-_bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   FullTransactionId latestRemovedFullXid)
 {
 	xl_btree_reuse_page xlrec_reuse;
 
@@ -775,7 +800,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedFullXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -856,26 +881,34 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 			if (_bt_conditionallockbuf(rel, buf))
 			{
 				page = BufferGetPage(buf);
-				if (_bt_page_recyclable(page))
+
+				/*
+				 * It's possible to find an all-zeroes page in an index.  For
+				 * example, a backend might successfully extend the relation
+				 * one page and then crash before it is able to make a WAL
+				 * entry for adding the page.  If we find a zeroed page then
+				 * reclaim it immediately.
+				 */
+				if (PageIsNew(page))
+				{
+					/* Okay to use page.  Initialize and return it. */
+					_bt_pageinit(page, BufferGetPageSize(buf));
+					return buf;
+				}
+
+				if (BTPageIsRecyclable(page))
 				{
 					/*
 					 * If we are generating WAL for Hot Standby then create a
 					 * WAL record that will allow us to conflict with queries
 					 * running on standby, in case they have snapshots older
-					 * than btpo.xact.  This can only apply if the page does
-					 * have a valid btpo.xact value, ie not if it's new.  (We
-					 * must check that because an all-zero page has no special
-					 * space.)
+					 * than safexid value
 					 */
-					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
-						!PageIsNew(page))
-					{
-						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+					if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
+						_bt_log_reuse_page(rel, blkno,
+										   BTPageGetDeleteXid(page));
 
-						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
-					}
-
-					/* Okay to use page.  Re-initialize and return it */
+					/* Okay to use page.  Re-initialize and return it. */
 					_bt_pageinit(page, BufferGetPageSize(buf));
 					return buf;
 				}
@@ -1073,40 +1106,6 @@ _bt_pageinit(Page page, Size size)
 	PageInit(page, size, sizeof(BTPageOpaqueData));
 }
 
-/*
- *	_bt_page_recyclable() -- Is an existing page recyclable?
- *
- * This exists to make sure _bt_getbuf and btvacuumscan have the same
- * policy about whether a page is safe to re-use.  But note that _bt_getbuf
- * knows enough to distinguish the PageIsNew condition from the other one.
- * At some point it might be appropriate to redesign this to have a three-way
- * result value.
- */
-bool
-_bt_page_recyclable(Page page)
-{
-	BTPageOpaque opaque;
-
-	/*
-	 * It's possible to find an all-zeroes page in an index --- for example, a
-	 * backend might successfully extend the relation one page and then crash
-	 * before it is able to make a WAL entry for adding the page. If we find a
-	 * zeroed page then reclaim it.
-	 */
-	if (PageIsNew(page))
-		return true;
-
-	/*
-	 * Otherwise, recycle if deleted and too old to have any processes
-	 * interested in it.
-	 */
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_ISDELETED(opaque) &&
-		GlobalVisCheckRemovableXid(NULL, opaque->btpo.xact))
-		return true;
-	return false;
-}
-
 /*
  * Delete item(s) from a btree leaf page during VACUUM.
  *
@@ -1768,16 +1767,12 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * that the btvacuumscan scan has yet to reach; they'll get counted later
  * instead.
  *
- * Maintains *oldestBtpoXact for any pages that get deleted.  Caller is
- * responsible for maintaining *oldestBtpoXact in the case of pages that were
- * deleted by a previous VACUUM.
- *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 uint32
-_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
+_bt_pagedel(Relation rel, Buffer leafbuf)
 {
 	uint32		ndeleted = 0;
 	BlockNumber rightsib;
@@ -1985,8 +1980,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, oldestBtpoXact,
-										  &ndeleted))
+										  &rightsib_empty, &ndeleted))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -2002,8 +1996,6 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 		}
 
 		Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
-		Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
-											*oldestBtpoXact));
 
 		rightsib = opaque->btpo_next;
 
@@ -2264,12 +2256,6 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  * containing leafbuf.  (We always set *rightsib_empty for caller, just to be
  * consistent.)
  *
- * We maintain *oldestBtpoXact for pages that are deleted by the current
- * VACUUM operation here.  This must be handled here because we conservatively
- * assume that there needs to be a new call to ReadNextTransactionId() each
- * time a page gets deleted.  See comments about the underlying assumption
- * below.
- *
  * Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
  * On success exit, we'll be holding pin and write lock.  On failure exit,
  * we'll release both pin and lock before returning (we define it that way
@@ -2277,8 +2263,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, TransactionId *oldestBtpoXact,
-						 uint32 *ndeleted)
+						 bool *rightsib_empty, uint32 *ndeleted)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
 	BlockNumber leafleftsib;
@@ -2294,12 +2279,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	BTMetaPageData *metad = NULL;
 	ItemId		itemid;
 	Page		page;
-	PageHeader	header;
 	BTPageOpaque opaque;
+	FullTransactionId safexid;
 	bool		rightsib_is_rightmost;
-	int			targetlevel;
+	uint32		targetlevel;
 	IndexTuple	leafhikey;
-	BlockNumber nextchild;
+	BlockNumber topparent_in_target;
 
 	page = BufferGetPage(leafbuf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -2343,7 +2328,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		page = BufferGetPage(buf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 		leftsib = opaque->btpo_prev;
-		targetlevel = opaque->btpo.level;
+		targetlevel = opaque->btpo_level;
 		Assert(targetlevel > 0);
 
 		/*
@@ -2450,7 +2435,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
-		nextchild = InvalidBlockNumber;
+
+		/* Leaf page is also target page: don't set topparent */
+		topparent_in_target = InvalidBlockNumber;
 	}
 	else
 	{
@@ -2459,11 +2446,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
 				 target, RelationGetRelationName(rel));
 
-		/* Remember the next non-leaf child down in the subtree */
+		/* Internal page is target: we'll set topparent in leaf page... */
 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
-		nextchild = BTreeTupleGetDownLink((IndexTuple) PageGetItem(page, itemid));
-		if (nextchild == leafblkno)
-			nextchild = InvalidBlockNumber;
+		topparent_in_target =
+			BTreeTupleGetTopParent((IndexTuple) PageGetItem(page, itemid));
+		/* ...except when it would be a redundant pointer-to-self */
+		if (topparent_in_target == leafblkno)
+			topparent_in_target = InvalidBlockNumber;
 	}
 
 	/*
@@ -2553,13 +2542,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	 * no lock was held.
 	 */
 	if (target != leafblkno)
-		BTreeTupleSetTopParent(leafhikey, nextchild);
+		BTreeTupleSetTopParent(leafhikey, topparent_in_target);
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
 	 * we're in VACUUM and would not otherwise have an XID.  Having already
-	 * updated links to the target, ReadNextTransactionId() suffices as an
+	 * updated links to the target, ReadNextFullTransactionId() suffices as an
 	 * upper bound.  Any scan having retained a now-stale link is advertising
 	 * in its PGPROC an xmin less than or equal to the value we read here.  It
 	 * will continue to do so, holding back the xmin horizon, for the duration
@@ -2568,17 +2557,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
-	opaque->btpo_flags &= ~BTP_HALF_DEAD;
-	opaque->btpo_flags |= BTP_DELETED;
-	opaque->btpo.xact = ReadNextTransactionId();
 
 	/*
-	 * Remove the remaining tuples on the page.  This keeps things simple for
-	 * WAL consistency checking.
+	 * Store upper bound XID that's used to determine when deleted page is no
+	 * longer needed as a tombstone
 	 */
-	header = (PageHeader) page;
-	header->pd_lower = SizeOfPageHeaderData;
-	header->pd_upper = header->pd_special;
+	safexid = ReadNextFullTransactionId();
+	BTPageSetDeleted(page, safexid);
+	opaque->btpo_cycleid = 0;
 
 	/* And update the metapage, if needed */
 	if (BufferIsValid(metabuf))
@@ -2616,15 +2602,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		if (target != leafblkno)
 			XLogRegisterBuffer(3, leafbuf, REGBUF_WILL_INIT);
 
-		/* information on the unlinked block */
+		/* information stored on the target/to-be-unlinked block */
 		xlrec.leftsib = leftsib;
 		xlrec.rightsib = rightsib;
-		xlrec.btpo_xact = opaque->btpo.xact;
+		xlrec.level = targetlevel;
+		xlrec.safexid = safexid;
 
 		/* information needed to recreate the leaf block (if not the target) */
 		xlrec.leafleftsib = leafleftsib;
 		xlrec.leafrightsib = leafrightsib;
-		xlrec.topparent = nextchild;
+		xlrec.topparent = topparent_in_target;
 
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeUnlinkPage);
 
@@ -2638,7 +2625,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.level = metad->btm_level;
 			xlmeta.fastroot = metad->btm_fastroot;
 			xlmeta.fastlevel = metad->btm_fastlevel;
-			xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+			xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
 			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
@@ -2681,9 +2668,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, lbuf);
 	_bt_relbuf(rel, rbuf);
 
-	if (!TransactionIdIsValid(*oldestBtpoXact) ||
-		TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
-		*oldestBtpoXact = opaque->btpo.xact;
+	/* If the target is not leafbuf, we're done with it now -- release it */
+	if (target != leafblkno)
+		_bt_relbuf(rel, buf);
 
 	/*
 	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
@@ -2693,10 +2680,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		(*ndeleted)++;
 
-	/* If the target is not leafbuf, we're done with it now -- release it */
-	if (target != leafblkno)
-		_bt_relbuf(rel, buf);
-
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..b5a674d9e0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber totFreePages;	/* true total # of free pages */
-	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
 } BTVacState;
 
@@ -790,7 +788,7 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
  * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup
  *
  * Called by btvacuumcleanup when btbulkdelete was never called because no
- * tuples need to be deleted.
+ * tuples needed to be deleted by VACUUM.
  *
  * When we return false, VACUUM can even skip the cleanup-only call to
  * btvacuumscan (i.e. there will be no btvacuumscan call for this index at
@@ -802,66 +800,72 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		result = false;
+	BTOptions  *relopts;
+	float8		cleanup_scale_factor;
+	uint32		btm_version;
+	BlockNumber prev_num_delpages;
+	float8		prev_num_heap_tuples;
 
+	/*
+	 * Copy details from metapage to local variables quickly.
+	 *
+	 * Note that we deliberately avoid using cached version of metapage here.
+	 */
 	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	btm_version = metad->btm_version;
+
+	if (btm_version < BTREE_NOVAC_VERSION)
+	{
+		/*
+		 * Metapage needs to be dynamically upgraded to store fields that are
+		 * only present when btm_version >= BTREE_NOVAC_VERSION
+		 */
+		_bt_relbuf(info->index, metabuf);
+		return true;
+	}
+
+	prev_num_delpages = metad->btm_last_cleanup_num_delpages;
+	prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	_bt_relbuf(info->index, metabuf);
 
 	/*
-	 * XXX: If IndexVacuumInfo contained the heap relation, we could be more
-	 * aggressive about vacuuming non catalog relations by passing the table
-	 * to GlobalVisCheckRemovableXid().
+	 * If table receives enough insertions and no cleanup was performed, then
+	 * index would appear have stale statistics.  If scale factor is set, we
+	 * avoid that by performing cleanup if the number of inserted tuples
+	 * exceeds vacuum_cleanup_index_scale_factor fraction of original tuples
+	 * count.
 	 */
+	relopts = (BTOptions *) info->index->rd_options;
+	cleanup_scale_factor = (relopts &&
+							relopts->vacuum_cleanup_index_scale_factor >= 0)
+		? relopts->vacuum_cleanup_index_scale_factor
+		: vacuum_cleanup_index_scale_factor;
 
-	if (metad->btm_version < BTREE_NOVAC_VERSION)
-	{
-		/*
-		 * Do cleanup if metapage needs upgrade, because we don't have
-		 * cleanup-related meta-information yet.
-		 */
-		result = true;
-	}
-	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
-			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
-	{
-		/*
-		 * If any oldest btpo.xact from a previously deleted page in the index
-		 * is visible to everyone, then at least one deleted page can be
-		 * recycled -- don't skip cleanup.
-		 */
-		result = true;
-	}
-	else
-	{
-		BTOptions  *relopts;
-		float8		cleanup_scale_factor;
-		float8		prev_num_heap_tuples;
+	if (cleanup_scale_factor <= 0 ||
+		info->num_heap_tuples < 0 ||
+		prev_num_heap_tuples <= 0 ||
+		(info->num_heap_tuples - prev_num_heap_tuples) /
+		prev_num_heap_tuples >= cleanup_scale_factor)
+		return true;
 
-		/*
-		 * If table receives enough insertions and no cleanup was performed,
-		 * then index would appear have stale statistics.  If scale factor is
-		 * set, we avoid that by performing cleanup if the number of inserted
-		 * tuples exceeds vacuum_cleanup_index_scale_factor fraction of
-		 * original tuples count.
-		 */
-		relopts = (BTOptions *) info->index->rd_options;
-		cleanup_scale_factor = (relopts &&
-								relopts->vacuum_cleanup_index_scale_factor >= 0)
-			? relopts->vacuum_cleanup_index_scale_factor
-			: vacuum_cleanup_index_scale_factor;
-		prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+	/*
+	 * Trigger cleanup in rare cases where prev_num_delpages exceeds 5% of the
+	 * total size of the index.  We can reasonably expect (though are not
+	 * guaranteed) to be able to recycle this many pages if we decide to do a
+	 * btvacuumscan call during the ongoing btvacuumcleanup.
+	 *
+	 * Our approach won't reliably avoid "wasted" cleanup-only btvacuumscan
+	 * calls.  That is, we can end up scanning the entire index without ever
+	 * placing even 1 of the prev_num_delpages pages in the free space map, at
+	 * least in certain narrow cases (see nbtree/README section on recycling
+	 * deleted pages for details).  This rarely matters in practice.
+	 */
+	if (prev_num_delpages > RelationGetNumberOfBlocks(info->index) / 20)
+		return true;
 
-		if (cleanup_scale_factor <= 0 ||
-			info->num_heap_tuples < 0 ||
-			prev_num_heap_tuples <= 0 ||
-			(info->num_heap_tuples - prev_num_heap_tuples) /
-			prev_num_heap_tuples >= cleanup_scale_factor)
-			result = true;
-	}
-
-	_bt_relbuf(info->index, metabuf);
-	return result;
+	return false;
 }
 
 /*
@@ -904,30 +908,62 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	BlockNumber num_delpages;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
 	/*
-	 * If btbulkdelete was called, we need not do anything, just return the
-	 * stats from the latest btbulkdelete call.  If it wasn't called, we might
-	 * still need to do a pass over the index, to recycle any newly-recyclable
-	 * pages or to obtain index statistics.  _bt_vacuum_needs_cleanup
-	 * determines if either are needed.
+	 * If btbulkdelete was called, we need not do anything (we just maintain
+	 * the information used within _bt_vacuum_needs_cleanup() by calling
+	 * _bt_set_cleanup_info() below).
 	 *
-	 * Since we aren't going to actually delete any leaf items, there's no
-	 * need to go through all the vacuum-cycle-ID pushups.
+	 * If btbulkdelete was _not_ called, then we have a choice to make: we
+	 * must decide whether or not a btvacuumscan() call is needed now (i.e.
+	 * whether the entire ongoing VACUUM operation can entirely avoid a
+	 * physical scan of the index).  A call to _bt_vacuum_needs_cleanup()
+	 * decides it for us now.
 	 */
 	if (stats == NULL)
 	{
-		/* Check if we need a cleanup */
+		/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
 		if (!_bt_vacuum_needs_cleanup(info))
 			return NULL;
 
+		/*
+		 * Since we aren't going to actually delete any leaf items, there's no
+		 * need to go through all the vacuum-cycle-ID pushups here
+		 */
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 		btvacuumscan(info, stats, NULL, NULL, 0);
 	}
 
+	/*
+	 * By here, we know for sure that this VACUUM operation won't be skipping
+	 * its btvacuumscan() call.  Maintain the count of the current number of
+	 * heap tuples in the metapage.  Also maintain the num_delpages value.
+	 * This information will be used by _bt_vacuum_needs_cleanup() during
+	 * future VACUUM operations that don't need to call btbulkdelete().
+	 *
+	 * num_delpages is the number of deleted pages now in the index that were
+	 * not safe to place in the FSM to be recycled just yet.  We expect that
+	 * it will almost certainly be possible to place all of these pages in the
+	 * FSM during the next VACUUM operation.  That factor alone might cause
+	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
+	 * btvacuumscan() call.
+	 *
+	 * Note: We must delay the _bt_set_cleanup_info() call until this late
+	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
+	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
+	 * just pg_class.reltuples for the heap relation _before_ VACUUM began.
+	 * In general cleanup info should describe the state of the index/table
+	 * _after_ VACUUM finishes.
+	 */
+	Assert(stats->pages_deleted >= stats->pages_free);
+	num_delpages = stats->pages_deleted - stats->pages_free;
+	_bt_set_cleanup_info(info->index, num_delpages, info->num_heap_tuples);
+
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
 	 * double-counting some index tuples, so disbelieve any total that exceeds
@@ -957,8 +993,6 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * deleted, and looking for old deleted pages that can be recycled.  Both
  * btbulkdelete and btvacuumcleanup invoke this (the latter only if no
  * btbulkdelete call occurred and _bt_vacuum_needs_cleanup returned true).
- * Note that this is also where the metadata used by _bt_vacuum_needs_cleanup
- * is maintained.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct
  * and for obtaining a vacuum cycle ID if necessary.
@@ -975,12 +1009,25 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bool		needLock;
 
 	/*
-	 * Reset counts that will be incremented during the scan; needed in case
-	 * of multiple scans during a single VACUUM command
+	 * Reset fields that track information about the entire index now.  This
+	 * avoids double-counting in the case where a single VACUUM command
+	 * requires multiple scans of the index.
+	 *
+	 * Avoid resetting the tuples_removed field here, since it tracks
+	 * information about the VACUUM command, and so must last across each call
+	 * to btvacuumscan().
+	 *
+	 * (Note that pages_free is treated as state about the whole index, not
+	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
+	 * calls are idempotent, and get repeated for the same deleted pages in
+	 * some scenarios.  The point for us is to track the number of recyclable
+	 * pages in the index at the end of the VACUUM command.)
 	 */
+	stats->num_pages = 0;
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
+	stats->pages_free = 0;
 
 	/* Set up info to pass down to btvacuumpage */
 	vstate.info = info;
@@ -988,8 +1035,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.totFreePages = 0;
-	vstate.oldestBtpoXact = InvalidTransactionId;
 
 	/* Create a temporary memory context to run _bt_pagedel in */
 	vstate.pagedelcontext = AllocSetContextCreate(CurrentMemoryContext,
@@ -1048,6 +1093,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
+	/* Set statistics num_pages field to final size of index */
+	stats->num_pages = num_pages;
+
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1062,27 +1110,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
 	 */
-	if (vstate.totFreePages > 0)
+	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
-
-	/*
-	 * Maintain the oldest btpo.xact and a count of the current number of heap
-	 * tuples in the metapage (for the benefit of _bt_vacuum_needs_cleanup).
-	 *
-	 * The page with the oldest btpo.xact is typically a page deleted by this
-	 * VACUUM operation, since pages deleted by a previous VACUUM operation
-	 * tend to be placed in the FSM (by the current VACUUM operation) -- such
-	 * pages are not candidates to be the oldest btpo.xact.  (Note that pages
-	 * placed in the FSM are reported as deleted pages in the bulk delete
-	 * statistics, despite not counting as deleted pages for the purposes of
-	 * determining the oldest btpo.xact.)
-	 */
-	_bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
-								 info->num_heap_tuples);
-
-	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
 }
 
 /*
@@ -1188,13 +1217,12 @@ backtrack:
 		}
 	}
 
-	/* Page is valid, see what to do with it */
-	if (_bt_page_recyclable(page))
+	if (!opaque || BTPageIsRecyclable(page))
 	{
 		/* Okay to recycle this page (which could be leaf or internal) */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->totFreePages++;
 		stats->pages_deleted++;
+		stats->pages_free++;
 	}
 	else if (P_ISDELETED(opaque))
 	{
@@ -1203,17 +1231,12 @@ backtrack:
 		 * recycle yet.
 		 */
 		stats->pages_deleted++;
-
-		/* Maintain the oldest btpo.xact */
-		if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
-			TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
-			vstate->oldestBtpoXact = opaque->btpo.xact;
 	}
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
 		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * oldestBtpoXact and pages_deleted below.
+		 * pages_deleted below.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1430,7 +1453,7 @@ backtrack:
 		 * count.  There will be no double-counting.
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
+		stats->pages_deleted += _bt_pagedel(rel, buf);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 2e3bda8171..d1177d8772 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -169,7 +169,7 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
 		 * we're on the level 1 and asked to lock leaf page in write mode,
 		 * then lock next page in write mode, because it must be a leaf.
 		 */
-		if (opaque->btpo.level == 1 && access == BT_WRITE)
+		if (opaque->btpo_level == 1 && access == BT_WRITE)
 			page_access = BT_WRITE;
 
 		/* drop the read lock on the page, then acquire one on its child */
@@ -2341,9 +2341,9 @@ _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 		}
 
 		/* Done? */
-		if (opaque->btpo.level == level)
+		if (opaque->btpo_level == level)
 			break;
-		if (opaque->btpo.level < level)
+		if (opaque->btpo_level < level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("btree level %u not found in index \"%s\"",
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5683daa34d..2c4d7f6e25 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -620,7 +620,7 @@ _bt_blnewpage(uint32 level)
 	/* Initialize BT opaque state */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	opaque->btpo_prev = opaque->btpo_next = P_NONE;
-	opaque->btpo.level = level;
+	opaque->btpo_level = level;
 	opaque->btpo_flags = (level > 0) ? 0 : BTP_LEAF;
 	opaque->btpo_cycleid = 0;
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..bc6cc7c668 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -112,7 +112,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_fastlevel = xlrec->fastlevel;
 	/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
-	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
+	md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
 
@@ -297,7 +297,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
 
 	ropaque->btpo_prev = origpagenumber;
 	ropaque->btpo_next = spagenumber;
-	ropaque->btpo.level = xlrec->level;
+	ropaque->btpo_level = xlrec->level;
 	ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
 	ropaque->btpo_cycleid = 0;
 
@@ -773,7 +773,7 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = xlrec->leftblk;
 	pageop->btpo_next = xlrec->rightblk;
-	pageop->btpo.level = 0;
+	pageop->btpo_level = 0;
 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -802,6 +802,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
+	uint32		level;
+	bool		isleaf;
+	FullTransactionId safexid;
 	Buffer		leftbuf;
 	Buffer		target;
 	Buffer		rightbuf;
@@ -810,6 +813,12 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	leftsib = xlrec->leftsib;
 	rightsib = xlrec->rightsib;
+	level = xlrec->level;
+	isleaf = (level == 0);
+	safexid = xlrec->safexid;
+
+	/* No topparent link for leaf page (level 0) or level 1 */
+	Assert(xlrec->topparent == InvalidBlockNumber || level > 1);
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -844,9 +853,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	pageop->btpo_prev = leftsib;
 	pageop->btpo_next = rightsib;
-	pageop->btpo.xact = xlrec->btpo_xact;
-	pageop->btpo_flags = BTP_DELETED;
-	if (!BlockNumberIsValid(xlrec->topparent))
+	pageop->btpo_level = level;
+	BTPageSetDeleted(page, safexid);
+	if (isleaf)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
 
@@ -892,6 +901,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		Buffer				leafbuf;
 		IndexTupleData		trunctuple;
 
+		Assert(!isleaf);
+
 		leafbuf = XLogInitBufferForRedo(record, 3);
 		page = (Page) BufferGetPage(leafbuf);
 
@@ -901,7 +912,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
 		pageop->btpo_prev = xlrec->leafleftsib;
 		pageop->btpo_next = xlrec->leafrightsib;
-		pageop->btpo.level = 0;
+		pageop->btpo_level = 0;
 		pageop->btpo_cycleid = 0;
 
 		/* Add a dummy hikey item */
@@ -942,7 +953,7 @@ btree_xlog_newroot(XLogReaderState *record)
 
 	pageop->btpo_flags = BTP_ROOT;
 	pageop->btpo_prev = pageop->btpo_next = P_NONE;
-	pageop->btpo.level = xlrec->level;
+	pageop->btpo_level = xlrec->level;
 	if (xlrec->level == 0)
 		pageop->btpo_flags |= BTP_LEAF;
 	pageop->btpo_cycleid = 0;
@@ -969,20 +980,22 @@ btree_xlog_reuse_page(XLogReaderState *record)
 	xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
 
 	/*
-	 * Btree reuse_page records exist to provide a conflict point when we
-	 * reuse pages in the index via the FSM.  That's all they do though.
+	 * latestRemovedFullXid was a deleted page's safexid value, set at the
+	 * point it was deleted.  The safexid value is once again used at the
+	 * point that the page is to be recycled for some unrelated new page in
+	 * the index.  This only happens when a xl_btree_reuse_page WAL record
+	 * must be written during original execution because it might be necessary
+	 * for us to generate a conflict here (when in Hot Standby mode).
 	 *
-	 * latestRemovedXid was the page's btpo.xact.  The
-	 * GlobalVisCheckRemovableXid test in _bt_page_recyclable() conceptually
-	 * mirrors the pgxact->xmin > limitXmin test in
+	 * GlobalVisCheckRemovableFullXid() tests are used to determine if it's
+	 * safe to recycle a page (that was deleted by VACUUM earlier on) during
+	 * original execution.  This mirrors the PGPROC->xmin > limitXmin test in
 	 * GetConflictingVirtualXIDs().  Consequently, one XID value achieves the
 	 * same exclusion effect on primary and standby.
 	 */
 	if (InHotStandby)
-	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
-	}
+		ResolveRecoveryConflictWithSnapshotFullXid(xlrec->latestRemovedFullXid,
+												   xlrec->node);
 }
 
 void
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..5cce10a5b6 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -80,9 +80,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
 
-				appendStringInfo(buf, "left %u; right %u; btpo_xact %u; ",
-								 xlrec->leftsib, xlrec->rightsib,
-								 xlrec->btpo_xact);
+				appendStringInfo(buf, "left %u; right %u; level %u; safexid %u:%u; ",
+								 xlrec->leftsib, xlrec->rightsib, xlrec->level,
+								 EpochFromFullTransactionId(xlrec->safexid),
+								 XidFromFullTransactionId(xlrec->safexid));
 				appendStringInfo(buf, "leafleft %u; leafright %u; topparent %u",
 								 xlrec->leafleftsib, xlrec->leafrightsib,
 								 xlrec->topparent);
@@ -99,9 +100,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
 
-				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u",
+				appendStringInfo(buf, "rel %u/%u/%u; latestRemovedXid %u:%u",
 								 xlrec->node.spcNode, xlrec->node.dbNode,
-								 xlrec->node.relNode, xlrec->latestRemovedXid);
+								 xlrec->node.relNode,
+								 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+								 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 				break;
 			}
 		case XLOG_BTREE_META_CLEANUP:
@@ -110,8 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
-								 xlrec->oldest_btpo_xact,
+				appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
+								 xlrec->last_cleanup_num_delpages,
 								 xlrec->last_cleanup_num_heap_tuples);
 				break;
 			}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 5877a60715..96ee839b3f 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -452,6 +452,34 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+/*
+ * Variant of ResolveRecoveryConflictWithSnapshot that works with
+ * FullTransactionId values
+ */
+void
+ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId latestRemovedFullXid,
+										   RelFileNode node)
+{
+	/*
+	 * ResolveRecoveryConflictWithSnapshot operates on 32-bit TransactionIds,
+	 * so truncate the logged FullTransactionId.  If the logged value is very
+	 * old, so that XID wrap-around already happened on it, there can't be any
+	 * snapshots that still see it.
+	 */
+	FullTransactionId nextXid = ReadNextFullTransactionId();
+	uint64			  diff;
+
+	diff = U64FromFullTransactionId(nextXid) -
+		U64FromFullTransactionId(latestRemovedFullXid);
+	if (diff < MaxTransactionId / 2)
+	{
+		TransactionId latestRemovedXid;
+
+		latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, node);
+	}
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 4db1a64d51..c1b755ad05 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -769,7 +769,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 											  P_FIRSTDATAKEY(opaque));
 				itup = (IndexTuple) PageGetItem(state->target, itemid);
 				nextleveldown.leftmost = BTreeTupleGetDownLink(itup);
-				nextleveldown.level = opaque->btpo.level - 1;
+				nextleveldown.level = opaque->btpo_level - 1;
 			}
 			else
 			{
@@ -794,14 +794,14 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 		if (opaque->btpo_prev != leftcurrent)
 			bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
 
-		/* Check level, which must be valid for non-ignorable page */
-		if (level.level != opaque->btpo.level)
+		/* Check level */
+		if (level.level != opaque->btpo_level)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										current, level.level, opaque->btpo.level)));
+										current, level.level, opaque->btpo_level)));
 
 		/* Verify invariants for page */
 		bt_target_page_check(state);
@@ -1167,7 +1167,7 @@ bt_target_page_check(BtreeCheckState *state)
 				bt_child_highkey_check(state,
 									   offset,
 									   NULL,
-									   topaque->btpo.level);
+									   topaque->btpo_level);
 			}
 			continue;
 		}
@@ -1529,7 +1529,7 @@ bt_target_page_check(BtreeCheckState *state)
 	if (!P_ISLEAF(topaque) && P_RIGHTMOST(topaque) && state->readonly)
 	{
 		bt_child_highkey_check(state, InvalidOffsetNumber,
-							   NULL, topaque->btpo.level);
+							   NULL, topaque->btpo_level);
 	}
 }
 
@@ -1606,7 +1606,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 		ereport(DEBUG1,
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg_internal("level %u leftmost page of index \"%s\" was found deleted or half dead",
-						opaque->btpo.level, RelationGetRelationName(state->rel)),
+						opaque->btpo_level, RelationGetRelationName(state->rel)),
 				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
 
 		/* Be slightly more pro-active in freeing this memory, just in case */
@@ -1910,14 +1910,15 @@ bt_child_highkey_check(BtreeCheckState *state,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 
-		/* Check level for non-ignorable page */
-		if (!P_IGNORE(opaque) && opaque->btpo.level != target_level - 1)
+		/* Do level sanity check */
+		if ((!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque)) &&
+			opaque->btpo_level != target_level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("block found while following rightlinks from child of index \"%s\" has invalid level",
 							RelationGetRelationName(state->rel)),
 					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
-										blkno, target_level - 1, opaque->btpo.level)));
+										blkno, target_level - 1, opaque->btpo_level)));
 
 		/* Try to detect circular links */
 		if ((!first && blkno == state->prevrightlink) || blkno == opaque->btpo_prev)
@@ -2145,7 +2146,7 @@ bt_child_check(BtreeCheckState *state, BTScanInsert targetkey,
 	 * check for downlink connectivity.
 	 */
 	bt_child_highkey_check(state, downlinkoffnum,
-						   child, topaque->btpo.level);
+						   child, topaque->btpo_level);
 
 	/*
 	 * Since there cannot be a concurrent VACUUM operation in readonly mode,
@@ -2290,7 +2291,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 				 errmsg_internal("harmless interrupted page split detected in index %s",
 						RelationGetRelationName(state->rel)),
 				 errdetail_internal("Block=%u level=%u left sibling=%u page lsn=%X/%X.",
-									blkno, opaque->btpo.level,
+									blkno, opaque->btpo_level,
 									opaque->btpo_prev,
 									(uint32) (pagelsn >> 32),
 									(uint32) pagelsn)));
@@ -2321,7 +2322,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 	elog(DEBUG1, "checking for interrupted multi-level deletion due to missing downlink in index \"%s\"",
 		 RelationGetRelationName(state->rel));
 
-	level = opaque->btpo.level;
+	level = opaque->btpo_level;
 	itemid = PageGetItemIdCareful(state, blkno, page, P_FIRSTDATAKEY(opaque));
 	itup = (IndexTuple) PageGetItem(page, itemid);
 	childblk = BTreeTupleGetDownLink(itup);
@@ -2336,16 +2337,16 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			break;
 
 		/* Do an extra sanity check in passing on internal pages */
-		if (copaque->btpo.level != level - 1)
+		if (copaque->btpo_level != level - 1)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg_internal("downlink points to block in index \"%s\" whose level is not one level down",
 									 RelationGetRelationName(state->rel)),
 					 errdetail_internal("Top parent/under check block=%u block pointed to=%u expected level=%u level in pointed to block=%u.",
 										blkno, childblk,
-										level - 1, copaque->btpo.level)));
+										level - 1, copaque->btpo_level)));
 
-		level = copaque->btpo.level;
+		level = copaque->btpo_level;
 		itemid = PageGetItemIdCareful(state, childblk, child,
 									  P_FIRSTDATAKEY(copaque));
 		itup = (IndexTuple) PageGetItem(child, itemid);
@@ -2407,7 +2408,7 @@ bt_downlink_missing_check(BtreeCheckState *state, bool rightsplit,
 			 errmsg("internal index block lacks downlink in index \"%s\"",
 					RelationGetRelationName(state->rel)),
 			 errdetail_internal("Block=%u level=%u page lsn=%X/%X.",
-								blkno, opaque->btpo.level,
+								blkno, opaque->btpo_level,
 								(uint32) (pagelsn >> 32),
 								(uint32) pagelsn)));
 }
@@ -3002,21 +3003,28 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	}
 
 	/*
-	 * Deleted pages have no sane "level" field, so can only check non-deleted
-	 * page level
+	 * Deleted pages that still use the old 32-bit XID representation have no
+	 * sane "level" field because they type pun the field, but all other pages
+	 * (including pages deleted on Postgres 14+) have a valid value.
 	 */
-	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
-						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+	if (!P_ISDELETED(opaque) || P_HAS_FULLXID(opaque))
+	{
+		/* Okay, no reason not to trust btpo_level field from page */
 
-	if (!P_ISLEAF(opaque) && !P_ISDELETED(opaque) &&
-		opaque->btpo.level == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
-						blocknum, RelationGetRelationName(state->rel))));
+		if (P_ISLEAF(opaque) && opaque->btpo_level != 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg_internal("invalid leaf page level %u for block %u in index \"%s\"",
+									 opaque->btpo_level, blocknum,
+									 RelationGetRelationName(state->rel))));
+
+		if (!P_ISLEAF(opaque) && opaque->btpo_level == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg_internal("invalid internal page level 0 for block %u in index \"%s\"",
+									 blocknum,
+									 RelationGetRelationName(state->rel))));
+	}
 
 	/*
 	 * Sanity checks for number of items on page.
@@ -3063,8 +3071,6 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 	 * state.  This state is nonetheless treated as corruption by VACUUM on
 	 * from version 9.4 on, so do the same here.  See _bt_pagedel() for full
 	 * details.
-	 *
-	 * Internal pages should never have garbage items, either.
 	 */
 	if (!P_ISLEAF(opaque) && P_ISHALFDEAD(opaque))
 		ereport(ERROR,
@@ -3073,11 +3079,27 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
 						blocknum, RelationGetRelationName(state->rel)),
 				 errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
 
+	/*
+	 * Check that internal pages have no garbage items, and that no page has
+	 * an invalid combination of page deletion related page level flags
+	 */
 	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
-				 errmsg("internal page block %u in index \"%s\" has garbage items",
-						blocknum, RelationGetRelationName(state->rel))));
+				 errmsg_internal("internal page block %u in index \"%s\" has garbage items",
+								 blocknum, RelationGetRelationName(state->rel))));
+
+	if (P_HAS_FULLXID(opaque) && !P_ISDELETED(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("full transaction id page flag appears in non-deleted block %u in index \"%s\"",
+								 blocknum, RelationGetRelationName(state->rel))));
+
+	if (P_ISDELETED(opaque) && P_ISHALFDEAD(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg_internal("deleted page block %u in index \"%s\" is half-dead",
+								 blocknum, RelationGetRelationName(state->rel))));
 
 	return page;
 }
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..b7725b572f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -75,11 +75,7 @@ typedef struct BTPageStat
 	/* opaque data */
 	BlockNumber btpo_prev;
 	BlockNumber btpo_next;
-	union
-	{
-		uint32		level;
-		TransactionId xact;
-	}			btpo;
+	uint32		btpo_level;
 	uint16		btpo_flags;
 	BTCycleId	btpo_cycleid;
 } BTPageStat;
@@ -112,9 +108,33 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* page type (flags) */
 	if (P_ISDELETED(opaque))
 	{
-		stat->type = 'd';
-		stat->btpo.xact = opaque->btpo.xact;
-		return;
+		/* We divide deleted pages into leaf ('d') or internal ('D') */
+		if (P_ISLEAF(opaque) || !P_HAS_FULLXID(opaque))
+			stat->type = 'd';
+		else
+			stat->type = 'D';
+
+		/*
+		 * Report safexid in a deleted page.
+		 *
+		 * Handle pg_upgrade'd deleted pages that used the previous safexid
+		 * representation in btpo_level field (this used to be a union type
+		 * called "bpto").
+		 */
+		if (P_HAS_FULLXID(opaque))
+		{
+			FullTransactionId safexid = BTPageGetDeleteXid(page);
+
+			elog(NOTICE, "deleted page from block %u has safexid %u:%u",
+				 blkno, EpochFromFullTransactionId(safexid),
+				 XidFromFullTransactionId(safexid));
+		}
+		else
+			elog(NOTICE, "deleted page from block %u has safexid %u",
+				 blkno, opaque->btpo_level);
+
+		/* Don't interpret BTDeletedPageData as index tuples */
+		maxoff = InvalidOffsetNumber;
 	}
 	else if (P_IGNORE(opaque))
 		stat->type = 'e';
@@ -128,7 +148,7 @@ GetBTPageStatistics(BlockNumber blkno, Buffer buffer, BTPageStat *stat)
 	/* btpage opaque data */
 	stat->btpo_prev = opaque->btpo_prev;
 	stat->btpo_next = opaque->btpo_next;
-	stat->btpo.level = opaque->btpo.level;
+	stat->btpo_level = opaque->btpo_level;
 	stat->btpo_flags = opaque->btpo_flags;
 	stat->btpo_cycleid = opaque->btpo_cycleid;
 
@@ -237,7 +257,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 	values[j++] = psprintf("%u", stat.free_size);
 	values[j++] = psprintf("%u", stat.btpo_prev);
 	values[j++] = psprintf("%u", stat.btpo_next);
-	values[j++] = psprintf("%u", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
+	values[j++] = psprintf("%u", stat.btpo_level);
 	values[j++] = psprintf("%d", stat.btpo_flags);
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
@@ -503,10 +523,14 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 
 		opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);
 
-		if (P_ISDELETED(opaque))
-			elog(NOTICE, "page is deleted");
-
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageData as index tuples */
+			elog(NOTICE, "page from block " INT64_FORMAT " is deleted", blkno);
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -603,7 +627,14 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
 		if (P_ISDELETED(opaque))
 			elog(NOTICE, "page is deleted");
 
-		fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		if (!P_ISDELETED(opaque))
+			fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+		else
+		{
+			/* Don't interpret BTDeletedPageData as index tuples */
+			elog(NOTICE, "page from block is deleted");
+			fctx->max_calls = 0;
+		}
 		uargs->leafpage = P_ISLEAF(opaque);
 		uargs->rightmost = P_RIGHTMOST(opaque);
 
@@ -692,10 +723,7 @@ bt_metap(PG_FUNCTION_ARGS)
 
 	/*
 	 * We need a kluge here to detect API versions prior to 1.8.  Earlier
-	 * versions incorrectly used int4 for certain columns.  This caused
-	 * various problems.  For example, an int4 version of the "oldest_xact"
-	 * column would not work with TransactionId values that happened to exceed
-	 * PG_INT32_MAX.
+	 * versions incorrectly used int4 for certain columns.
 	 *
 	 * There is no way to reliably avoid the problems created by the old
 	 * function definition at this point, so insist that the user update the
@@ -723,7 +751,8 @@ bt_metap(PG_FUNCTION_ARGS)
 	 */
 	if (metad->btm_version >= BTREE_NOVAC_VERSION)
 	{
-		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
+		values[j++] = psprintf(INT64_FORMAT,
+							   (int64) metad->btm_last_cleanup_num_delpages);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
 	}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index a7632be36a..c60bc88560 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -3,16 +3,16 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
 CREATE INDEX test1_a_idx ON test1 USING btree (a);
 \x
 SELECT * FROM bt_metap('test1_a_idx');
--[ RECORD 1 ]-----------+-------
-magic                   | 340322
-version                 | 4
-root                    | 1
-level                   | 0
-fastroot                | 1
-fastlevel               | 0
-oldest_xact             | 0
-last_cleanup_num_tuples | -1
-allequalimage           | t
+-[ RECORD 1 ]-------------+-------
+magic                     | 340322
+version                   | 4
+root                      | 1
+level                     | 0
+fastroot                  | 1
+fastlevel                 | 0
+last_cleanup_num_delpages | 0
+last_cleanup_num_tuples   | -1
+allequalimage             | t
 
 SELECT * FROM bt_page_stats('test1_a_idx', -1);
 ERROR:  invalid block number
@@ -29,7 +29,7 @@ page_size     | 8192
 free_size     | 8128
 btpo_prev     | 0
 btpo_next     | 0
-btpo          | 0
+btpo_level    | 0
 btpo_flags    | 3
 
 SELECT * FROM bt_page_stats('test1_a_idx', 2);
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index 79a42a7b11..be89a64ca1 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -66,6 +66,23 @@ RETURNS smallint
 AS 'MODULE_PATHNAME', 'page_checksum_1_9'
 LANGUAGE C STRICT PARALLEL SAFE;
 
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT last_cleanup_num_delpages int8,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
+
 --
 -- bt_page_stats()
 --
@@ -80,7 +97,7 @@ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int8,
     OUT free_size int4,
     OUT btpo_prev int8,
     OUT btpo_next int8,
-    OUT btpo int4,
+    OUT btpo_level int8,
     OUT btpo_flags int4)
 AS 'MODULE_PATHNAME', 'bt_page_stats_1_9'
 LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index b1ce0d77d7..5368bb30f0 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -283,8 +283,12 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 		page = BufferGetPage(buffer);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		/* Determine page type, and update totals */
-
+		/*
+		 * Determine page type, and update totals.
+		 *
+		 * Note that we arbitrarily bucket deleted pages together without
+		 * considering if they're leaf pages or internal pages.
+		 */
 		if (P_ISDELETED(opaque))
 			indexStat.deleted_pages++;
 		else if (P_IGNORE(opaque))
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index c733341984..00d3a15454 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -298,16 +298,16 @@ test=# SELECT t_ctid, raw_flags, combined_flags
       index's metapage.  For example:
 <screen>
 test=# SELECT * FROM bt_metap('pg_cast_oid_index');
--[ RECORD 1 ]-----------+-------
-magic                   | 340322
-version                 | 4
-root                    | 1
-level                   | 0
-fastroot                | 1
-fastlevel               | 0
-oldest_xact             | 582
-last_cleanup_num_tuples | 1000
-allequalimage           | f
+-[ RECORD 1 ]-------------+-------
+magic                     | 340322
+version                   | 4
+root                      | 1
+level                     | 0
+fastroot                  | 1
+fastlevel                 | 0
+last_cleanup_num_delpages | 0
+last_cleanup_num_tuples   | 230
+allequalimage             | f
 </screen>
      </para>
     </listitem>
@@ -337,7 +337,7 @@ page_size     | 8192
 free_size     | 3668
 btpo_prev     | 0
 btpo_next     | 0
-btpo          | 0
+btpo_level    | 0
 btpo_flags    | 3
 </screen>
      </para>
-- 
2.27.0

#24

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#23)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 19, 2021 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Thanks for your explanation.

Attached is v5, which has some of the changes I talked about. Changes
from v4 include:

* Now only updates metapage during btvacuumcleanup() in the first
patch, which is enough to fix the existing
IndexVacuumInfo.num_heap_tuples issue.

* Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia.

* The second patch in the series now respects work_mem when sizing the
BTPendingRecycle array.

* New enhancement to the XID GlobalVisCheckRemovableFullXid() test
used in the second patch, to allow it to recycle even more pages.
(Still unsure of some of the details here.)

Thank you for updating the patch!

I would like to commit the first patch in a few days -- I refer to the
big patch that makes deleted page XIDs 64-bit/full. Can you take a
look at that one, Masahiko? That would be helpful. I can produce a bug
fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I
think that that should be written after the first patch is finalized
and committed.

I'll look at the first patch first.

The second patch (the new recycling optimization) will require more
work and testing.

Then also look at those patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#25

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#24)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 19, 2021 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 19, 2021 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 18, 2021 at 3:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Agreed. Thanks for your explanation.

Attached is v5, which has some of the changes I talked about. Changes
from v4 include:

* Now only updates metapage during btvacuumcleanup() in the first
patch, which is enough to fix the existing
IndexVacuumInfo.num_heap_tuples issue.

* Restored _bt_getbuf() page-from-FSM XID check. Out of sheer paranoia.

* The second patch in the series now respects work_mem when sizing the
BTPendingRecycle array.

* New enhancement to the XID GlobalVisCheckRemovableFullXid() test
used in the second patch, to allow it to recycle even more pages.
(Still unsure of some of the details here.)

Thank you for updating the patch!

I would like to commit the first patch in a few days -- I refer to the
big patch that makes deleted page XIDs 64-bit/full. Can you take a
look at that one, Masahiko? That would be helpful. I can produce a bug
fix for the IndexVacuumInfo.num_heap_tuples issue fairly easily, but I
think that that should be written after the first patch is finalized
and committed.

I'll look at the first patch first.

The 0001 patch looks good to me. In the documentation, I think we need
to update the following paragraph in the description of
vacuum_cleanup_index_scale_factor:

If no tuples were deleted from the heap, B-tree indexes are still
scanned at the VACUUM cleanup stage when at least one of the following
conditions is met: the index statistics are stale, or the index
contains deleted pages that can be recycled during cleanup. Index
statistics are considered to be stale if the number of newly inserted
tuples exceeds the vacuum_cleanup_index_scale_factor fraction of the
total number of heap tuples detected by the previous statistics
collection. The total number of heap tuples is stored in the index
meta-page. Note that the meta-page does not include this data until
VACUUM finds no dead tuples, so B-tree index scan at the cleanup stage
can only be skipped if the second and subsequent VACUUM cycles detect
no dead tuples.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#26

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#25)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Feb 22, 2021 at 4:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. In the documentation, I think we need
to update the following paragraph in the description of
vacuum_cleanup_index_scale_factor:

Good point. I think that the structure should make the page deletion
triggering condition have only secondary importance -- it is only
described at all to be complete and exhaustive. The
vacuum_cleanup_index_scale_factor-related threshold is all that users
will really care about in this area.

The reasons for this are: it's pretty rare to have many page
deletions, but never again delete/non-hot update even one single
tuple. But when that happens, it's *much* rarer still to *also* have
inserts, that might actually benefit from recycling the deleted page.
So it's very narrow.

I think that I'll add a "Note" box that talks about the page deletion
stuff, right at the end. It's actually kind of an awkward thing to
describe, and yet I think we still need to describe it.

I also think that the existing documentation should clearly point out
that the vacuum_cleanup_index_scale_factor only gets considered when
there are no updates or deletes since the last VACUUM -- that seems
like an existing problem worth fixing now. It's way too unclear that
this setting only really concerns append-only tables.

--
Peter Geoghegan

#27

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#26)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Feb 23, 2021 at 7:55 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Feb 22, 2021 at 4:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. In the documentation, I think we need
to update the following paragraph in the description of
vacuum_cleanup_index_scale_factor:

Good point. I think that the structure should make the page deletion
triggering condition have only secondary importance -- it is only
described at all to be complete and exhaustive. The
vacuum_cleanup_index_scale_factor-related threshold is all that users
will really care about in this area.

The reasons for this are: it's pretty rare to have many page
deletions, but never again delete/non-hot update even one single
tuple. But when that happens, it's *much* rarer still to *also* have
inserts, that might actually benefit from recycling the deleted page.
So it's very narrow.

I think that I'll add a "Note" box that talks about the page deletion
stuff, right at the end. It's actually kind of an awkward thing to
describe, and yet I think we still need to describe it.

Yeah, triggering btvacuumscan() by having many deleted index pages
will become a rare case. Users are unlikely to experience it in
practice. But it's still worth describing it.

I also think that the existing documentation should clearly point out
that the vacuum_cleanup_index_scale_factor only gets considered when
there are no updates or deletes since the last VACUUM -- that seems
like an existing problem worth fixing now. It's way too unclear that
this setting only really concerns append-only tables.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#28

Justin Pryzby

pryzby@telsasoft.com

almost 5 years ago

In reply to: Peter Geoghegan (#26)

1 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Feb 22, 2021 at 02:54:54PM -0800, Peter Geoghegan wrote:

On Mon, Feb 22, 2021 at 4:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. In the documentation, I think we need
to update the following paragraph in the description of
vacuum_cleanup_index_scale_factor:

Good point. I think that the structure should make the page deletion
triggering condition have only secondary importance -- it is only
described at all to be complete and exhaustive. The
vacuum_cleanup_index_scale_factor-related threshold is all that users
will really care about in this area.

The reasons for this are: it's pretty rare to have many page
deletions, but never again delete/non-hot update even one single
tuple. But when that happens, it's *much* rarer still to *also* have
inserts, that might actually benefit from recycling the deleted page.
So it's very narrow.

I think that I'll add a "Note" box that talks about the page deletion
stuff, right at the end. It's actually kind of an awkward thing to
describe, and yet I think we still need to describe it.

I also think that the existing documentation should clearly point out
that the vacuum_cleanup_index_scale_factor only gets considered when
there are no updates or deletes since the last VACUUM -- that seems
like an existing problem worth fixing now. It's way too unclear that
this setting only really concerns append-only tables.

e5d8a999030418a1b9e53d5f15ccaca7ed674877
| I (pgeoghegan) have chosen to remove any mention of deleted pages in the
| documentation of the vacuum_cleanup_index_scale_factor GUC/param, since
| the presence of deleted (though unrecycled) pages is no longer of much
| concern to users. The vacuum_cleanup_index_scale_factor description in
| the docs now seems rather unclear in any case, and it should probably be
| rewritten in the near future. Perhaps some passing mention of page
| deletion will be added back at the same time.

I think 8e12f4a25 wasn't quite aggressive enough in its changes, and I had
another patch laying around. I rebased and came up with this.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9851ca68b4..5da2e705b9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8522,24 +8522,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
      </term>
      <listitem>
       <para>
        Specifies [-the fraction-]{+a multiplier+} of the total number of heap tuples[-counted in-]
[-        the previous statistics collection-] that can be
        inserted [-without-]{+before+} incurring an index scan at the <command>VACUUM</command>
        cleanup stage.
        This setting currently applies to B-tree indexes only.
       </para>

<para>
[-If-]{+During <command>VACUUM</command>, if there are+} no {+dead+} tuples [-were deleted from-]{+found while+}
{+ scanning+} the heap, [-B-tree-]{+then the index vacuum phase is skipped.+}
{+ However,+} indexes [-are-]{+might+} still {+be+} scanned [-at-]{+during+} the[-<command>VACUUM</command>-] cleanup [-stage when-]{+phase. Setting this+}
{+ parameter enables+} the [-index's-]{+possibility to skip scanning indexes during cleanup.+}
{+ Indexes will always be scanned when their+} statistics are stale.
Index statistics are considered {+to be+} stale if the number of newly
inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
[-fraction-]{+multiplier+} of the total number of heap tuples [-detected by-]{+at the time of+} the previous
[-statistics collection.-]{+vacuum cleanup.+} The total number of heap tuples is stored in
the index meta-page. Note that the meta-page does not include this data
until <command>VACUUM</command> finds no dead tuples, so B-tree index
[-scan-]{+scans+} at the cleanup stage [-can only-]{+cannot+} be skipped [-if the second and-]
[- subsequent <command>VACUUM</command> cycles detect-]{+until after a vacuum cycle+}
{+ which detects+} no dead tuples.
</para>

<para>

Attachments:

0001-fix-vacuum_cleanup_index_scale_factor.patchtext/x-diff; charset=us-asciiDownload

From 44cf90c5b06fb21c2c4d379568b9fe86a54f1530 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sun, 15 Mar 2020 13:06:02 -0500
Subject: [PATCH] fix: vacuum_cleanup_index_scale_factor

---
 doc/src/sgml/config.sgml | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9851ca68b4..5da2e705b9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8522,24 +8522,26 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
-        incurring an index scan at the <command>VACUUM</command> cleanup stage.
+        Specifies a multiplier of the total number of heap tuples that can be
+        inserted before incurring an index scan at the <command>VACUUM</command>
+        cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
 
        <para>
-        If no tuples were deleted from the heap, B-tree indexes are still
-        scanned at the <command>VACUUM</command> cleanup stage when the
-        index's statistics are stale.  Index statistics are considered
-        stale if the number of newly inserted tuples exceeds the
-        <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        During <command>VACUUM</command>, if there are no dead tuples found while
+        scanning the heap, then the index vacuum phase is skipped.
+        However, indexes might still be scanned during the cleanup phase.  Setting this
+        parameter enables the possibility to skip scanning indexes during cleanup.
+        Indexes will always be scanned when their statistics are stale.
+        Index statistics are considered to be stale if the number of newly
+        inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
+        multiplier of the total number of heap tuples at the time of the previous
+        vacuum cleanup. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
-        scan at the cleanup stage can only be skipped if the second and
-        subsequent <command>VACUUM</command> cycles detect no dead tuples.
+        scans at the cleanup stage cannot be skipped until after a vacuum cycle
+        which detects no dead tuples.
        </para>
 
        <para>
-- 
2.17.0

#29

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#26)

2 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Feb 22, 2021 at 2:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Good point. I think that the structure should make the page deletion
triggering condition have only secondary importance -- it is only
described at all to be complete and exhaustive. The
vacuum_cleanup_index_scale_factor-related threshold is all that users
will really care about in this area.

I pushed the main 64-bit XID commit just now. Thanks!

Attached is v6, with the two remaining patches. No real changes. Just
want to keep CFBot happy.

I would like to talk about vacuum_cleanup_index_scale_factor some
more. I didn't get very far with the vacuum_cleanup_index_scale_factor
documentation (I just removed the existing references to page
deletion). When I was working on the docs I suddenly wondered: is
vacuum_cleanup_index_scale_factor actually necessary? Can we not get
rid of it completely?

The amvacuumcleanup docs seems to suggest that that would be okay:

"It is OK to return NULL if the index was not changed at all during
the VACUUM operation, but otherwise correct stats should be returned."

Currently, _bt_vacuum_needs_cleanup() gets to decide whether or not
the index will change during VACUUM (assuming no deleted pages in the
case of Postgres 11 - 13, or assuming less than ~5% on Postgres 14).
So why even bother with the heap tuple stuff at all? Why not simply
remove the triggering logic that uses btm_last_cleanup_num_heap_tuples
+ vacuum_cleanup_index_scale_factor completely? We can rely on ANALYZE
to set pg_class.reltuples/pg_class.relpages instead. IIUC this is 100%
allowed by the amvacuumcleanup contract.

I think that the original design that made VACUUM set
pg_class.reltuples/pg_class.relpages in indexes (from 15+ years ago)
assumed that it was cheap to handle statistics in passing -- the
marginal cost was approximately zero, so why not just do it? It was
not because VACUUM thinks it is valuable or urgent, and yet
vacuum_cleanup_index_scale_factor seems to assume that it must.

Of course, it may actually be hard/expensive to update the statistics
due to the vacuum_cleanup_index_scale_factor stuff that was added to
Postgres 11. The autovacuum_vacuum_insert_threshold stuff that was
added to Postgres 13 also seems quite relevant. So I think that there
is an inconsistency here.

I can see one small problem with my plan of relying on ANALYZE to do
this: VACUUM ANALYZE trusts amvacuumcleanup/btvacuumcleanup (when
called by lazyvacuum.c) to set pg_class.reltuples/pg_class.relpages
within do_analyze_rel() -- even when amvacuumcleanup/btvacuumcleanup
returns NULL:

/*
* Same for indexes. Vacuum always scans all indexes, so if we're part of
* VACUUM ANALYZE, don't overwrite the accurate count already inserted by
* VACUUM.
*/
if (!inh && !(params->options & VACOPT_VACUUM))
{
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
double totalindexrows;

totalindexrows = ceil(thisdata->tupleFract * totalrows);
vac_update_relstats(Irel[ind],
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
false,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
}
}

But this just seems like a very old bug to me. This bug can be fixed
separately by teaching VACUUM ANALYZE to recognize cases where indexes
did not have their stats updated in the way it expects.

BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
cases following the deduplication commit until my bug fix commit
48e12913 (which was kind of a hack itself). This meant that the
statistics set by btvacuumcleanup (in the case where btbulkdelete
doesn't get called, the relevant case for
vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
before anybody noticed (or at least until anybody complained).

Am I missing something here?

--
Peter Geoghegan

Attachments:

v6-0002-Show-pages-newly-deleted-in-VACUUM-VERBOSE-output.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Show-pages-newly-deleted-in-VACUUM-VERBOSE-output.patchDownload

From 115439b7d68a273ea31231b2a68ed23221f05f36 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Feb 2021 21:40:50 -0800
Subject: [PATCH v6 2/2] Show "pages newly deleted" in VACUUM VERBOSE output.

Teach VACUUM VERBOSE to distinguish between pages that were deleted by
the current VACUUM operation and all deleted pages in the index (without
regard to when or how they became deleted).  The latter metric has been
output by VACUUM verbose for many years.  Showing both together seems
far more informative.

The new VACUUM VERBOSE field will be helpful to both PostgreSQL users
and PostgreSQL developers that want to understand when and how page
deletions are executed, and when and how free pages can actually be
recycled.
---
 src/include/access/genam.h            | 11 ++++++++---
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  | 19 ++++++++++++++++---
 src/backend/access/heap/vacuumlazy.c  |  4 +++-
 src/backend/access/nbtree/nbtpage.c   |  4 ++++
 src/backend/access/nbtree/nbtree.c    | 17 +++++++++++------
 src/backend/access/spgist/spgvacuum.c |  1 +
 7 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ffa1a4c80d..13971c8b2a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -63,8 +63,12 @@ typedef struct IndexVacuumInfo
  * of which this is just the first field; this provides a way for ambulkdelete
  * to communicate additional private data to amvacuumcleanup.
  *
- * Note: pages_deleted and pages_free refer to free space within the index
- * file.  Some index AMs may compute num_index_tuples by reference to
+ * Note: pages_newly_deleted is the number of pages in the index that were
+ * deleted by the current vacuum operation.  pages_deleted and pages_free
+ * refer to free space within the index file (and so pages_deleted must be >=
+ * pages_newly_deleted).
+ *
+ * Note: Some index AMs may compute num_index_tuples by reference to
  * num_heap_tuples, in which case they should copy the estimated_count field
  * from IndexVacuumInfo.
  */
@@ -74,7 +78,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_newly_deleted;	/* # pages marked deleted by us  */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a0453b36cd..a276eb020b 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -231,6 +231,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
+	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ddecb8ab18..0663193531 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -133,9 +133,21 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MemoryContext oldctx;
 
 	/*
-	 * Reset counts that will be incremented during the scan; needed in case
-	 * of multiple scans during a single VACUUM command.
+	 * Reset fields that track information about the entire index now.  This
+	 * avoids double-counting in the case where a single VACUUM command
+	 * requires multiple scans of the index.
+	 *
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to gistvacuumscan().
+	 *
+	 * (Note that pages_free is treated as state about the whole index, not
+	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
+	 * calls are idempotent, and get repeated for the same deleted pages in
+	 * some scenarios.  The point for us is to track the number of recyclable
+	 * pages in the index at the end of the VACUUM command.)
 	 */
+	stats->num_pages = 0;
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
@@ -281,8 +293,8 @@ restart:
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->stats->pages_free++;
 		vstate->stats->pages_deleted++;
+		vstate->stats->pages_free++;
 	}
 	else if (GistPageIsDeleted(page))
 	{
@@ -636,6 +648,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* mark the page as deleted */
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
+	stats->pages_newly_deleted++;
 	stats->pages_deleted++;
 
 	/* remove the downlink from the parent */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bb78162f5..d8f847b0e6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,9 +2521,11 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages were newly deleted.\n"
+						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
+						   (*stats)->pages_newly_deleted,
 						   (*stats)->pages_deleted, (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7cf9332be2..9d7d0186d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2675,11 +2675,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
+	 * Maintain pages_newly_deleted, which is simply the number of pages
+	 * deleted by the ongoing VACUUM operation.
+	 *
 	 * Maintain pages_deleted in a way that takes into account how
 	 * btvacuumpage() will count deleted pages that have yet to become
 	 * scanblkno -- only count page when it's not going to get that treatment
 	 * later on.
 	 */
+	stats->pages_newly_deleted++;
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b25c8c5d5b..c7b05df2df 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -876,6 +876,9 @@ _bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
 	IndexBulkDeleteResult *stats = vstate->stats;
 	Relation	heapRel;
 
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
 	/*
 	 * Recompute VACUUM XID boundaries.
 	 *
@@ -1078,9 +1081,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * avoids double-counting in the case where a single VACUUM command
 	 * requires multiple scans of the index.
 	 *
-	 * Avoid resetting the tuples_removed field here, since it tracks
-	 * information about the VACUUM command, and so must last across each call
-	 * to btvacuumscan().
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to btvacuumscan().
 	 *
 	 * (Note that pages_free is treated as state about the whole index, not
 	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
@@ -1321,8 +1324,8 @@ backtrack:
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
-		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * Half-dead leaf page.  Try to delete now.  Might end up incrementing
+		 * pages_newly_deleted/pages_deleted inside _bt_pagedel.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1534,7 +1537,9 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * _bt_pagedel maintains the bulk delete stats on our behalf
+		 * _bt_pagedel maintains the bulk delete stats on our behalf;
+		 * pages_newly_deleted and pages_deleted are likely to be incremented
+		 * during call
 		 */
 		Assert(blkno == scanblkno);
 		_bt_pagedel(rel, buf, vstate);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

v6-0001-Recycle-pages-deleted-during-same-VACUUM.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 3f3a1b3d856c28d37d3d61ed7460943de03f8199 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Feb 2021 21:40:50 -0800
Subject: [PATCH v6 1/2] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         |  38 +++++++++-
 src/backend/access/nbtree/README    |  31 ++++++++
 src/backend/access/nbtree/nbtpage.c |  90 ++++++++++++++++------
 src/backend/access/nbtree/nbtree.c  | 111 ++++++++++++++++++++++++----
 src/backend/access/nbtree/nbtxlog.c |  22 ++++++
 5 files changed, 251 insertions(+), 41 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ac90d7439..736d69b304 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -312,6 +313,39 @@ BTPageIsRecyclable(Page page)
 	return false;
 }
 
+/*
+ * BTVacState is nbtree.c state used during VACUUM.  It is exported for use by
+ * page deletion related code in nbtpage.c.
+ */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
+typedef struct BTVacState
+{
+	/*
+	 * VACUUM operation state
+	 */
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
+	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
+} BTVacState;
+
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
  *	page.  The high key is not a tuple that is used to visit the heap.  It is
@@ -1181,7 +1215,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern void _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a43805a7b0..7cf9332be2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BTVacState *vstate);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1760,20 +1760,22 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Maintains bulk delete stats for caller, which are taken from vstate.  We
+ * need to cooperate closely with caller here so that whole VACUUM operation
+ * reliably avoids any double counting of subsidiary-to-leafbuf pages that we
+ * delete in passing.  If such pages happen to be from a block number that is
+ * ahead of the current scanblkno position, then caller is expected to count
+ * them directly later on.  It's simpler for us to understand caller's
+ * requirements than it would be for caller to understand when or how a
+ * deleted page became deleted after the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
-uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+void
+_bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 {
-	uint32		ndeleted = 0;
 	BlockNumber rightsib;
 	bool		rightsib_empty;
 	Page		page;
@@ -1781,7 +1783,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are added to bulk delete stat's pages_deleted
+	 * count.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -1843,7 +1846,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 										 RelationGetRelationName(rel))));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1873,7 +1876,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			Assert(!P_ISHALFDEAD(opaque));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1922,8 +1925,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
-					Assert(ndeleted == 0);
-					return ndeleted;
+					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
@@ -1964,7 +1966,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -1979,7 +1981,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, vstate))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -1990,7 +1992,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				 * lock and pin on leafbuf for us.
 				 */
 				Assert(false);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -2026,8 +2028,6 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
-
-	return ndeleted;
 }
 
 /*
@@ -2262,9 +2262,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BTVacState *vstate)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	IndexBulkDeleteResult *stats = vstate->stats;
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
 	BlockNumber target;
@@ -2674,12 +2675,53 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
-	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * Maintain pages_deleted in a way that takes into account how
+	 * btvacuumpage() will count deleted pages that have yet to become
+	 * scanblkno -- only count page when it's not going to get that treatment
+	 * later on.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		stats->pages_deleted++;
+
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3b2e0aa5cb..b25c8c5d5b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,23 +34,13 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 
 
-/* Working state needed by btvacuumpage */
-typedef struct
-{
-	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
-	IndexBulkDeleteCallback callback;
-	void	   *callback_state;
-	BTCycleId	cycleid;
-	MemoryContext pagedelcontext;
-} BTVacState;
-
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
@@ -871,6 +863,68 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -956,6 +1010,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1044,6 +1106,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1112,7 +1184,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
@@ -1451,12 +1534,10 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel maintains the bulk delete stats on our behalf
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		_bt_pagedel(rel, buf, vstate);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 8b7c143db4..6ab9af4a43 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)
-- 
2.27.0

#30

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Justin Pryzby (#28)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Feb 24, 2021 at 8:13 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

I think 8e12f4a25 wasn't quite aggressive enough in its changes, and I had
another patch laying around. I rebased and came up with this.

See my remarks/questions about vacuum_cleanup_index_scale_factor
addressed to Masahiko from a little earlier. I think that it might
make sense to just remove it. It might even make sense to disable it
in the backbranches -- that approach might be better than trying to
fix the "IndexVacuumInfo.num_heap_tuples is only representative of the
heap relation at the end of the VACUUM when considered within
btvacuumcleanup()" bug. (Though I'm less confident on this second
point about a backpatchable fix.)

--
Peter Geoghegan

#31

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#29)

Re: 64-bit XIDs in deleted nbtree pages

On Thu, Feb 25, 2021 at 1:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Feb 22, 2021 at 2:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

Good point. I think that the structure should make the page deletion
triggering condition have only secondary importance -- it is only
described at all to be complete and exhaustive. The
vacuum_cleanup_index_scale_factor-related threshold is all that users
will really care about in this area.

I pushed the main 64-bit XID commit just now. Thanks!

Awesome!

Attached is v6, with the two remaining patches. No real changes. Just
want to keep CFBot happy.

Thank you for updating the patch. I'll have a look at them.

I would like to talk about vacuum_cleanup_index_scale_factor some
more. I didn't get very far with the vacuum_cleanup_index_scale_factor
documentation (I just removed the existing references to page
deletion). When I was working on the docs I suddenly wondered: is
vacuum_cleanup_index_scale_factor actually necessary? Can we not get
rid of it completely?

The amvacuumcleanup docs seems to suggest that that would be okay:

"It is OK to return NULL if the index was not changed at all during
the VACUUM operation, but otherwise correct stats should be returned."

Currently, _bt_vacuum_needs_cleanup() gets to decide whether or not
the index will change during VACUUM (assuming no deleted pages in the
case of Postgres 11 - 13, or assuming less than ~5% on Postgres 14).
So why even bother with the heap tuple stuff at all? Why not simply
remove the triggering logic that uses btm_last_cleanup_num_heap_tuples
+ vacuum_cleanup_index_scale_factor completely? We can rely on ANALYZE
to set pg_class.reltuples/pg_class.relpages instead. IIUC this is 100%
allowed by the amvacuumcleanup contract.

I think that the original design that made VACUUM set
pg_class.reltuples/pg_class.relpages in indexes (from 15+ years ago)
assumed that it was cheap to handle statistics in passing -- the
marginal cost was approximately zero, so why not just do it? It was
not because VACUUM thinks it is valuable or urgent, and yet
vacuum_cleanup_index_scale_factor seems to assume that it must.

Of course, it may actually be hard/expensive to update the statistics
due to the vacuum_cleanup_index_scale_factor stuff that was added to
Postgres 11. The autovacuum_vacuum_insert_threshold stuff that was
added to Postgres 13 also seems quite relevant. So I think that there
is an inconsistency here.

btvacuumcleanup() has been playing two roles: recycling deleted pages
and collecting index statistics. Before introducing
vacuum_cleanup_index_scale_factor, btvacuumcleanup() always scanned
the index for both purpose. So it was a problem that we do an index
scan when anti-wraparound vacuum even if the table has not been
changed at all. The motivation of vacuum_cleanup_index_scale_factor is
to decrease the frequency of collecting index statistics (but not to
eliminate it). Since deleted pages could be left by btvacuumcleanup()
skipping an index scan, we introduced btm_oldest_btpo_xact (and it
became unnecessary by commit e5d8a99903).

If we don't want btvacuumcleanup() to collect index statistics, we can
remove vacuum_cleanup_index_scale_factor (at least from btree
perspectives), as you mentioned. One thing that may be worth
mentioning is that the difference between the index statistics taken
by ANALYZE and btvacuumcleanup() is that the former statistics is
always an estimation. That’s calculated by compute_index_stats()
whereas the latter uses the result of an index scan. If
btvacuumcleanup() doesn’t scan the index and always returns NULL, it
would become hard to get accurate index statistics, for example in a
static table case. I've not checked which cases index statistics
calculated by compute_index_stats() are inaccurate, though.

I can see one small problem with my plan of relying on ANALYZE to do
this: VACUUM ANALYZE trusts amvacuumcleanup/btvacuumcleanup (when
called by lazyvacuum.c) to set pg_class.reltuples/pg_class.relpages
within do_analyze_rel() -- even when amvacuumcleanup/btvacuumcleanup
returns NULL:

/*
* Same for indexes. Vacuum always scans all indexes, so if we're part of
* VACUUM ANALYZE, don't overwrite the accurate count already inserted by
* VACUUM.
*/
if (!inh && !(params->options & VACOPT_VACUUM))
{
for (ind = 0; ind < nindexes; ind++)
{
AnlIndexData *thisdata = &indexdata[ind];
double totalindexrows;

totalindexrows = ceil(thisdata->tupleFract * totalrows);
vac_update_relstats(Irel[ind],
RelationGetNumberOfBlocks(Irel[ind]),
totalindexrows,
0,
false,
InvalidTransactionId,
InvalidMultiXactId,
in_outer_xact);
}
}

But this just seems like a very old bug to me. This bug can be fixed
separately by teaching VACUUM ANALYZE to recognize cases where indexes
did not have their stats updated in the way it expects.

According to the doc, if amvacuumcleanup/btvacuumcleanup returns NULL,
it means the index is not changed at all. So do_analyze_rel() executed
by VACUUM ANALYZE also doesn't need to update the index statistics
even when amvacuumcleanup/btvacuumcleanup returns NULL. No?

BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
cases following the deduplication commit until my bug fix commit
48e12913 (which was kind of a hack itself). This meant that the
statistics set by btvacuumcleanup (in the case where btbulkdelete
doesn't get called, the relevant case for
vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
before anybody noticed (or at least until anybody complained).

Maybe we need more regression tests here.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#32

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#31)

1 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Thu, Feb 25, 2021 at 5:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

btvacuumcleanup() has been playing two roles: recycling deleted pages
and collecting index statistics.

Right.

I pushed the VACUUM VERBOSE "index pages newly deleted"
instrumentation patch earlier - it really isn't complicated or
controversial, so I saw no reason to delay with that.

Attached is v7, which now only has the final patch -- the optimization
that makes it possible for VACUUM to recycle pages that were newly
deleted during the same VACUUM operation. Still no real changes.
Again, I just wanted to keep CFBot happy. I haven't thought about or
improved this final patch recently, and it clearly needs more work to
be ready to commit.

If we don't want btvacuumcleanup() to collect index statistics, we can
remove vacuum_cleanup_index_scale_factor (at least from btree
perspectives), as you mentioned. One thing that may be worth
mentioning is that the difference between the index statistics taken
by ANALYZE and btvacuumcleanup() is that the former statistics is
always an estimation. That’s calculated by compute_index_stats()
whereas the latter uses the result of an index scan. If
btvacuumcleanup() doesn’t scan the index and always returns NULL, it
would become hard to get accurate index statistics, for example in a
static table case. I've not checked which cases index statistics
calculated by compute_index_stats() are inaccurate, though.

The historic context makes it easier to understand what to do here --
it makes it clear that amvacuumcleanup() routine does not (or should
not) do any index scan when the index hasn't (and won't) be modified
by the current VACUUM operation. The relevant sgml doc sentence I
quoted to you recently ("It is OK to return NULL if the index was not
changed at all during the VACUUM operation...") was added by commit
e57345975cf in 2006. Much of the relevant 2006 discussion is here,
FWIW:

/messages/by-id/26433.1146598265@sss.pgh.pa.us

So now we have the formal rules for index AMs, as well as background
information about what various hackers (mostly Tom) were considering
when the rules were written.

According to the doc, if amvacuumcleanup/btvacuumcleanup returns NULL,
it means the index is not changed at all. So do_analyze_rel() executed
by VACUUM ANALYZE also doesn't need to update the index statistics
even when amvacuumcleanup/btvacuumcleanup returns NULL. No?

Consider hashvacuumcleanup() -- here it is in full (it hasn't really
changed since 2006, when it was updated by that same commit I cited):

/*
* Post-VACUUM cleanup.
*
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
{
Relation rel = info->index;
BlockNumber num_pages;

/* If hashbulkdelete wasn't called, return NULL signifying no change */
/* Note: this covers the analyze_only case too */
if (stats == NULL)
return NULL;

/* update statistics */
num_pages = RelationGetNumberOfBlocks(rel);
stats->num_pages = num_pages;

return stats;
}

Clearly hashvacuumcleanup() was considered by Tom when he revised the
documentation in 2006. Here are some observations about
hashvacuumcleanup() that seem relevant now:

* There is no "analyze_only" handling, just like nbtree.

"analyze_only" is only used by GIN, even now, 15+ years after it was
added. GIN uses it to make autovacuum workers (never VACUUM outside of
an AV worker) do pending list insertions for ANALYZE -- just to make
it happen more often. This is a niche thing -- clearly we don't have
to care about it in nbtree, even if we make btvacuumcleanup() (almost)
always return NULL when there was no btbulkdelete() call.

* num_pages (which will become pg_class.relpages for the index) is not
set when we return NULL -- hashvacuumcleanup() assumes that ANALYZE
will get to it eventually in the case where VACUUM does no real work
(when it just returns NULL).

* We also use RelationGetNumberOfBlocks() to set pg_class.relpages for
index relations during ANALYZE -- it's called when we call
vac_update_relstats() (I quoted this do_analyze_rel() code to you
directly in a recent email).

* In general, pg_class.relpages isn't an estimate (because we use
RelationGetNumberOfBlocks(), both in the VACUUM-updates case and the
ANALYZE-updates case) -- only pg_class.reltuples is truly an estimate
during ANALYZE, and so getting a "true count" seems to have only
limited practical importance.

I think that this sets a precedent in support of my view that we can
simply get rid of vacuum_cleanup_index_scale_factor without any
special effort to maintain pg_class.reltuples. As I said before, we
can safely make btvacuumcleanup() just like hashvacuumcleanup(),
except when there are known deleted-but-not-recycled pages, where a
full index scan really is necessary for reasons that are not related
to statistics at all (of course we still need the *logic* that was
added to nbtree by the vacuum_cleanup_index_scale_factor commit --
that is clearly necessary). My guess is that Tom would have made
btvacuumcleanup() look identical to hashvacuumcleanup() in 2006 if
nbtree didn't have page deletion to consider -- but that had to be
considered.

My reasoning here is also based on the tendency of the core code to
mostly think of hash indexes as very similar to nbtree indexes.

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law". Realistically, hash indexes are far
less popular than nbtree indexes, and so even if I am 100% correct in
theory, the real world might not be so convinced by my legalistic
argument. We've already seen the issue with VACUUM ANALYZE (which has
not been truly consistent with the behavior hashvacuumcleanup() for
many years). There might be more.

I suppose I could ask Tom what he thinks? The hardest question is what
to do in the backbranches...I really don't have a strong opinion right
now.

BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
cases following the deduplication commit until my bug fix commit
48e12913 (which was kind of a hack itself). This meant that the
statistics set by btvacuumcleanup (in the case where btbulkdelete
doesn't get called, the relevant case for
vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
before anybody noticed (or at least until anybody complained).

Maybe we need more regression tests here.

I agree, but my point was that even a 100% broken approach to stats
within btvacuumcleanup() is not that noticeable. This supports the
idea that it just doesn't matter very much if a cleanup-only scan of
the index never takes place (or only takes place when we need to
recycle deleted pages, which is generally rare but will become very
rare once I commit the attached patch).

Also, my fix for this bug (commit 48e12913) was actually pretty bad;
there are now cases where the btvacuumcleanup()-only VACUUM case will
set pg_class.reltuples to a value that is significantly below what it
should be (it all depends on how effective deduplication is with the
data). I probably should have made btvacuumcleanup()-only VACUUMs set
"stats->estimate_count = true", purely to make sure that the core code
doesn't trust the statistics too much (it's okay for VACUUM VERBOSE
output only). Right now we can get a pg_class.reltuples that is
"exactly wrong" -- it would actually be a big improvement if it was
"approximately correct".

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

--
Peter Geoghegan

Attachments:

v7-0001-Recycle-pages-deleted-during-same-VACUUM.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 554f5ed05252d616641c05082bf3105d4d0d83f9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 25 Feb 2021 15:17:22 -0800
Subject: [PATCH v7] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         | 22 ++++++-
 src/backend/access/nbtree/README    | 31 +++++++++
 src/backend/access/nbtree/nbtpage.c | 40 ++++++++++++
 src/backend/access/nbtree/nbtree.c  | 97 +++++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtxlog.c | 22 +++++++
 5 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b56b7b7868..876b8f3437 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -316,14 +317,33 @@ BTPageIsRecyclable(Page page)
  * BTVacState is private nbtree.c state used during VACUUM.  It is exported
  * for use by page deletion related code in nbtpage.c.
  */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
 typedef struct BTVacState
 {
+	/*
+	 * VACUUM operation state
+	 */
 	IndexVacuumInfo *info;
 	IndexBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
 	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
 } BTVacState;
 
 /*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 629a23628e..9d7d0186d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2687,6 +2687,46 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
+
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 504f5bef17..8aed93ff0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,6 +34,7 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
@@ -860,6 +863,71 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -945,6 +1013,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1033,6 +1109,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1101,7 +1187,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 8b7c143db4..6ab9af4a43 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)
-- 
2.27.0

#33

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#32)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 26, 2021 at 9:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 25, 2021 at 5:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

btvacuumcleanup() has been playing two roles: recycling deleted pages
and collecting index statistics.

Right.

I pushed the VACUUM VERBOSE "index pages newly deleted"
instrumentation patch earlier - it really isn't complicated or
controversial, so I saw no reason to delay with that.

Thanks!

I think we can improve bloom indexes in a separate patch so that they
use pages_newly_deleted.

Attached is v7, which now only has the final patch -- the optimization
that makes it possible for VACUUM to recycle pages that were newly
deleted during the same VACUUM operation. Still no real changes.
Again, I just wanted to keep CFBot happy. I haven't thought about or
improved this final patch recently, and it clearly needs more work to
be ready to commit.

I've looked at the patch. The patch is straightforward and I agreed
with the direction.

Here are some comments on v7 patch.

---
+   /* Allocate _bt_newly_deleted_pages_recycle related information */
+   vstate.ndeletedspace = 512;

Maybe add a #define for the value 512?

----
+   for (int i = 0; i < vstate->ndeleted; i++)
+   {
+       BlockNumber blkno = vstate->deleted[i].blkno;
+       FullTransactionId safexid = vstate->deleted[i].safexid;
+
+       if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+           break;
+
+       RecordFreeIndexPage(rel, blkno);
+       stats->pages_free++;
+   }

Should we use 'continue' instead of 'break'? Or can we sort
vstate->deleted array by full XID and leave 'break'?

---
Currently, the patch checks only newly-deleted-pages if they are
recyclable at the end of btvacuumscan. What do you think about the
idea of checking also pages that are deleted by previous vacuums
(i.g., pages already marked P_ISDELETED() but not
BTPageIsRecyclable())? There is still a little hope that such pages
become recyclable when we reached the end of btvacuumscan. We will end
up checking such pages twice (during btvacuumscan() and the end of
btvacuumscan()) but if the cost of collecting and checking pages is
not high it probably could expand the chance of recycling pages.

I'm going to reply to the discussion vacuum_cleanup_index_scale_factor
in a separate mail. Or maybe it's better to start a new thread for
that so as get opinions from other hackers. It's no longer related to
the subject.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#34

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#32)

Re: 64-bit XIDs in deleted nbtree pages

On Fri, Feb 26, 2021 at 9:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

If we don't want btvacuumcleanup() to collect index statistics, we can
remove vacuum_cleanup_index_scale_factor (at least from btree
perspectives), as you mentioned. One thing that may be worth
mentioning is that the difference between the index statistics taken
by ANALYZE and btvacuumcleanup() is that the former statistics is
always an estimation. That’s calculated by compute_index_stats()
whereas the latter uses the result of an index scan. If
btvacuumcleanup() doesn’t scan the index and always returns NULL, it
would become hard to get accurate index statistics, for example in a
static table case. I've not checked which cases index statistics
calculated by compute_index_stats() are inaccurate, though.

The historic context makes it easier to understand what to do here --
it makes it clear that amvacuumcleanup() routine does not (or should
not) do any index scan when the index hasn't (and won't) be modified
by the current VACUUM operation. The relevant sgml doc sentence I
quoted to you recently ("It is OK to return NULL if the index was not
changed at all during the VACUUM operation...") was added by commit
e57345975cf in 2006. Much of the relevant 2006 discussion is here,
FWIW:

/messages/by-id/26433.1146598265@sss.pgh.pa.us

So now we have the formal rules for index AMs, as well as background
information about what various hackers (mostly Tom) were considering
when the rules were written.

According to the doc, if amvacuumcleanup/btvacuumcleanup returns NULL,
it means the index is not changed at all. So do_analyze_rel() executed
by VACUUM ANALYZE also doesn't need to update the index statistics
even when amvacuumcleanup/btvacuumcleanup returns NULL. No?

Consider hashvacuumcleanup() -- here it is in full (it hasn't really
changed since 2006, when it was updated by that same commit I cited):

/*
* Post-VACUUM cleanup.
*
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
{
Relation rel = info->index;
BlockNumber num_pages;

/* If hashbulkdelete wasn't called, return NULL signifying no change */
/* Note: this covers the analyze_only case too */
if (stats == NULL)
return NULL;

/* update statistics */
num_pages = RelationGetNumberOfBlocks(rel);
stats->num_pages = num_pages;

return stats;
}

Clearly hashvacuumcleanup() was considered by Tom when he revised the
documentation in 2006. Here are some observations about
hashvacuumcleanup() that seem relevant now:

* There is no "analyze_only" handling, just like nbtree.

"analyze_only" is only used by GIN, even now, 15+ years after it was
added. GIN uses it to make autovacuum workers (never VACUUM outside of
an AV worker) do pending list insertions for ANALYZE -- just to make
it happen more often. This is a niche thing -- clearly we don't have
to care about it in nbtree, even if we make btvacuumcleanup() (almost)
always return NULL when there was no btbulkdelete() call.

* num_pages (which will become pg_class.relpages for the index) is not
set when we return NULL -- hashvacuumcleanup() assumes that ANALYZE
will get to it eventually in the case where VACUUM does no real work
(when it just returns NULL).

* We also use RelationGetNumberOfBlocks() to set pg_class.relpages for
index relations during ANALYZE -- it's called when we call
vac_update_relstats() (I quoted this do_analyze_rel() code to you
directly in a recent email).

* In general, pg_class.relpages isn't an estimate (because we use
RelationGetNumberOfBlocks(), both in the VACUUM-updates case and the
ANALYZE-updates case) -- only pg_class.reltuples is truly an estimate
during ANALYZE, and so getting a "true count" seems to have only
limited practical importance.

I think that this sets a precedent in support of my view that we can
simply get rid of vacuum_cleanup_index_scale_factor without any
special effort to maintain pg_class.reltuples. As I said before, we
can safely make btvacuumcleanup() just like hashvacuumcleanup(),
except when there are known deleted-but-not-recycled pages, where a
full index scan really is necessary for reasons that are not related
to statistics at all (of course we still need the *logic* that was
added to nbtree by the vacuum_cleanup_index_scale_factor commit --
that is clearly necessary). My guess is that Tom would have made
btvacuumcleanup() look identical to hashvacuumcleanup() in 2006 if
nbtree didn't have page deletion to consider -- but that had to be
considered.

Makes sense. If getting a true pg_class.reltuples is not important in
practice, it seems not to need btvacuumcleanup() do an index scan for
getting statistics purpose.

My reasoning here is also based on the tendency of the core code to
mostly think of hash indexes as very similar to nbtree indexes.

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law". Realistically, hash indexes are far
less popular than nbtree indexes, and so even if I am 100% correct in
theory, the real world might not be so convinced by my legalistic
argument. We've already seen the issue with VACUUM ANALYZE (which has
not been truly consistent with the behavior hashvacuumcleanup() for
many years). There might be more.

I suppose I could ask Tom what he thinks?

The hardest question is what
to do in the backbranches...I really don't have a strong opinion right
now.

Since it seems not a bug I personally think we don't need to do
anything for back branches. But if we want not to trigger an index
scan by vacuum_cleanup_index_scale_factor, we could change the default
value to a high value (say, to 10000) so that it can skip an index
scan in most cases.

BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
cases following the deduplication commit until my bug fix commit
48e12913 (which was kind of a hack itself). This meant that the
statistics set by btvacuumcleanup (in the case where btbulkdelete
doesn't get called, the relevant case for
vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
before anybody noticed (or at least until anybody complained).

Maybe we need more regression tests here.

I agree, but my point was that even a 100% broken approach to stats
within btvacuumcleanup() is not that noticeable. This supports the
idea that it just doesn't matter very much if a cleanup-only scan of
the index never takes place (or only takes place when we need to
recycle deleted pages, which is generally rare but will become very
rare once I commit the attached patch).

Also, my fix for this bug (commit 48e12913) was actually pretty bad;
there are now cases where the btvacuumcleanup()-only VACUUM case will
set pg_class.reltuples to a value that is significantly below what it
should be (it all depends on how effective deduplication is with the
data). I probably should have made btvacuumcleanup()-only VACUUMs set
"stats->estimate_count = true", purely to make sure that the core code
doesn't trust the statistics too much (it's okay for VACUUM VERBOSE
output only). Right now we can get a pg_class.reltuples that is
"exactly wrong" -- it would actually be a big improvement if it was
"approximately correct".

Understood. Thank you for your explanation.

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

IIUC the purpose of autovacuum_vacuum_insert_scale_factor is
visibility map maintenance. And as per this discussion, it seems not
necessary to do an index scan in btvacuumcleanup() triggered by
autovacuum_vacuum_insert_scale_factor.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#35

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#34)

Re: 64-bit XIDs in deleted nbtree pages

On Sun, Feb 28, 2021 at 8:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law".

I suppose I could ask Tom what he thinks?

+1

Are you going to start a new thread, or should I?

Since it seems not a bug I personally think we don't need to do
anything for back branches. But if we want not to trigger an index
scan by vacuum_cleanup_index_scale_factor, we could change the default
value to a high value (say, to 10000) so that it can skip an index
scan in most cases.

One reason to remove vacuum_cleanup_index_scale_factor in the back
branches is that it removes any need to fix the
"IndexVacuumInfo.num_heap_tuples is inaccurate outside of
btvacuumcleanup-only VACUUMs" bug -- it just won't matter if
btm_last_cleanup_num_heap_tuples is inaccurate anymore. (I am still
not sure about backpatch being a good idea, though.)

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

IIUC the purpose of autovacuum_vacuum_insert_scale_factor is
visibility map maintenance. And as per this discussion, it seems not
necessary to do an index scan in btvacuumcleanup() triggered by
autovacuum_vacuum_insert_scale_factor.

Arguably the question of skipping scanning the index should have been
considered by the autovacuum_vacuum_insert_scale_factor patch when it
was committed for Postgres 13 -- but it wasn't. There is a regression
that was tied to autovacuum_vacuum_insert_scale_factor in Postgres 13
by Mark Callaghan, which I suspect is relevant:

https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html

The blog post says: "Updates - To understand the small regression
mentioned above for the l.i1 test (more CPU & write IO) I repeated the
test with 100M rows using 2 configurations: one disabled index
deduplication and the other disabled insert-triggered autovacuum.
Disabling index deduplication had no effect and disabling
insert-triggered autovacuum resolves the regression."

This is quite specifically with an insert-only workload, with 4
indexes (that's from memory, but I'm pretty sure it's 4). I think that
the failure to account for skipping index scans is probably the big
problem here. Scanning the heap to set VM bits is unlikely to be
expensive compared to the full index scans. An insert-only workload is
going to find most of the heap blocks it scans to set VM bits in
shared_buffers. Not so for the indexes.

So in Postgres 13 we have this autovacuum_vacuum_insert_scale_factor
issue, in addition to the deduplication + btvacuumcleanup issue we
talked about (the problems left by my Postgres 13 bug fix commit
48e12913). These two issues make removing
vacuum_cleanup_index_scale_factor tempting, even in the back branches
-- it might actually be the more conservative approach, at least for
Postgres 13.

--
Peter Geoghegan

#36

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#35)

3 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Mar 1, 2021 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote:

Since it seems not a bug I personally think we don't need to do
anything for back branches. But if we want not to trigger an index
scan by vacuum_cleanup_index_scale_factor, we could change the default
value to a high value (say, to 10000) so that it can skip an index
scan in most cases.

One reason to remove vacuum_cleanup_index_scale_factor in the back
branches is that it removes any need to fix the
"IndexVacuumInfo.num_heap_tuples is inaccurate outside of
btvacuumcleanup-only VACUUMs" bug -- it just won't matter if
btm_last_cleanup_num_heap_tuples is inaccurate anymore. (I am still
not sure about backpatch being a good idea, though.)

Attached is v8 of the patch series, which has new patches. No real
changes compared to v7 for the first patch, though.

There are now two additional prototype patches to remove the
vacuum_cleanup_index_scale_factor GUC/param along the lines we've
discussed. This requires teaching VACUUM ANALYZE about when to trust
VACUUM cleanup to set the statistics (that's what v8-0002* does).

The general idea for VACUUM ANALYZE in v8-0002* is to assume that
cleanup-only VACUUMs won't set the statistics accurately -- so we need
to keep track of this during VACUUM (in case it's a VACUUM ANALYZE,
which now needs to know if index vacuuming was "cleanup only" or not).
This is not a new thing for hash indexes -- they never did anything in
the cleanup-only case (hashvacuumcleanup() just returns NULL). And now
nbtree does the same thing (usually). Not all AMs will, but the new
assumption is much better than the one it replaces.

I thought of another existing case that violated the faulty assumption
made by VACUUM ANALYZE (which v8-0002* fixes): VACUUM's INDEX_CLEANUP
feature (which was added to Postgres 12 by commit a96c41feec6) is
another case where VACUUM does nothing with indexes. VACUUM ANALYZE
mistakenly considers that index vacuuming must have run and set the
pg_class statistics to an accurate value (more accurate than it is
capable of). But with INDEX_CLEANUP we won't even call
amvacuumcleanup().

--
Peter Geoghegan

Attachments:

v8-0001-Recycle-pages-deleted-during-same-VACUUM.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 967a057607ce2d0b648e324a9085ab4ccecd828e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 25 Feb 2021 15:17:22 -0800
Subject: [PATCH v8 1/3] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         | 22 ++++++-
 src/backend/access/nbtree/README    | 31 +++++++++
 src/backend/access/nbtree/nbtpage.c | 40 ++++++++++++
 src/backend/access/nbtree/nbtree.c  | 97 +++++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtxlog.c | 22 +++++++
 5 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b56b7b7868..876b8f3437 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -316,14 +317,33 @@ BTPageIsRecyclable(Page page)
  * BTVacState is private nbtree.c state used during VACUUM.  It is exported
  * for use by page deletion related code in nbtpage.c.
  */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
 typedef struct BTVacState
 {
+	/*
+	 * VACUUM operation state
+	 */
 	IndexVacuumInfo *info;
 	IndexBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
 	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
 } BTVacState;
 
 /*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 629a23628e..9d7d0186d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2687,6 +2687,46 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
+
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 504f5bef17..8aed93ff0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,6 +34,7 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
@@ -860,6 +863,71 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -945,6 +1013,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1033,6 +1109,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1101,7 +1187,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 8b7c143db4..6ab9af4a43 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)
-- 
2.27.0

v8-0002-VACUUM-ANALYZE-Distrust-cleanup-only-stats.patchtext/x-patch; charset=US-ASCII; name=v8-0002-VACUUM-ANALYZE-Distrust-cleanup-only-stats.patchDownload

From 304839183156a11dbb33812ef040e0317f9d614b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Mar 2021 14:40:57 -0800
Subject: [PATCH v8 2/3] VACUUM ANALYZE: Distrust cleanup-only stats.

Distrust the stats from VACUUM within VACUUM ANALYZE when we know that
index AMs must only have had amvacuumcleanup() calls, without any calls
to ambulkdelete().

This establishes the convention that amvacuumcleanup() usually only
gives an estimate for num_index_tuples.
---
 src/include/commands/vacuum.h        |  6 +++++
 src/backend/access/heap/vacuumlazy.c | 15 +++++++++++-
 src/backend/commands/analyze.c       | 34 ++++++++++++++++++++++++----
 src/backend/commands/vacuum.c        |  1 +
 src/backend/postmaster/autovacuum.c  |  1 +
 5 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..efacaf758a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -221,6 +221,12 @@ typedef struct VacuumParams
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
+	/* XXX: output param approach is grotty, breaks backbranch ABI */
+
+	bool		indexvacuuming;		/* Output param: VACUUM took place and
+									 * performed ambulkdelete calls for
+									 * indexes? */
+
 	/*
 	 * The number of parallel vacuum workers.  0 by default which means choose
 	 * based on the number of indexes.  -1 indicates parallel vacuum is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e6..8716d305d0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1054,6 +1054,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			lazy_vacuum_all_indexes(onerel, Irel, indstats,
 									vacrelstats, lps, nindexes);
 
+			/*
+			 * Remember index VACUUMing (not just cleanup) having taken
+			 * place
+			 */
+			params->indexvacuuming = true;
+
 			/* Remove tuples from heap */
 			lazy_vacuum_heap(onerel, vacrelstats);
 
@@ -1711,6 +1717,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
 								lps, nindexes);
 
+		/*
+		 * Remember index VACUUMing (not just cleanup) having taken
+		 * place
+		 */
+		params->indexvacuuming = true;
+
 		/* Remove tuples from heap */
 		lazy_vacuum_heap(onerel, vacrelstats);
 	}
@@ -1737,7 +1749,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		end_parallel_vacuum(indstats, lps, nindexes);
 
 	/* Update index statistics */
-	update_index_statistics(Irel, indstats, nindexes);
+	if (vacrelstats->useindex)
+		update_index_statistics(Irel, indstats, nindexes);
 
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..d28febdd4b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -620,11 +620,21 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/*
-	 * Same for indexes. Vacuum always scans all indexes, so if we're part of
-	 * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
-	 * VACUUM.
+	 * Same for indexes, at least in most cases.
+	 *
+	 * VACUUM usually scans all indexes.  When we're part of VACUUM ANALYZE,
+	 * and when VACUUM is known to have actually deleted index tuples, index
+	 * AMs will generally give accurate reltuples -- so don't overwrite the
+	 * accurate count already inserted by VACUUM.
+	 *
+	 * Most individual index AMs only give an estimate in the event of a
+	 * cleanup-only VACUUM, though -- update stats in these cases, since our
+	 * estimate will be at least as good anyway.  (It's possible that
+	 * individual index AMs will have accurate num_index_tuples statistics
+	 * even for a cleanup-only VACUUM.  We don't bother recognizing that; it's
+	 * pretty rare.)
 	 */
-	if (!inh && !(params->options & VACOPT_VACUUM))
+	if (!inh && !params->indexvacuuming)
 	{
 		for (ind = 0; ind < nindexes; ind++)
 		{
@@ -654,9 +664,23 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		pgstat_report_analyze(onerel, totalrows, totaldeadrows,
 							  (va_cols == NIL));
 
-	/* If this isn't part of VACUUM ANALYZE, let index AMs do cleanup */
+	/*
+	 * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup.
+	 *
+	 * Note that most index AMs perform a no-op as a matter of policy for
+	 * amvacuumcleanup() when called in ANALYZE-only mode, so in practice this
+	 * usually does no work (GIN indexes rely on ANALYZE cleanup calls).
+	 *
+	 * Do not confuse this no-op case with the !indexvacuuming VACUUM ANALYZE
+	 * case, which is the case where ambulkdelete() wasn't called for any
+	 * indexes during a VACUUM or a VACUUM ANALYZE.  There probably _were_
+	 * amvacuumcleanup() calls for VACUUM ANALYZE -- they probably did very
+	 * little work, but they're not no-ops to the index AM generally.
+	 */
 	if (!(params->options & VACOPT_VACUUM))
 	{
+		Assert(!params->indexvacuuming);
+
 		for (ind = 0; ind < nindexes; ind++)
 		{
 			IndexBulkDeleteResult *stats;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..8e98ae98cd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -110,6 +110,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	/* Set default value */
 	params.index_cleanup = VACOPT_TERNARY_DEFAULT;
 	params.truncate = VACOPT_TERNARY_DEFAULT;
+	params.indexvacuuming = false;	/* For now */
 
 	/* By default parallel vacuum is enabled */
 	params.nworkers = 0;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..27d87bac34 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2925,6 +2925,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			(!wraparound ? VACOPT_SKIP_LOCKED : 0);
 		tab->at_params.index_cleanup = VACOPT_TERNARY_DEFAULT;
 		tab->at_params.truncate = VACOPT_TERNARY_DEFAULT;
+		tab->at_params.indexvacuuming = false; /* for now */
 		/* As of now, we don't support parallel vacuum for autovacuum */
 		tab->at_params.nworkers = -1;
 		tab->at_params.freeze_min_age = freeze_min_age;
-- 
2.27.0

v8-0003-Remove-vacuum_cleanup_index_scale_factor-GUC-para.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Remove-vacuum_cleanup_index_scale_factor-GUC-para.patchDownload

From b357737d75b4e0827be987e6290292e4f912942e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 1 Mar 2021 15:58:56 -0800
Subject: [PATCH v8 3/3] Remove vacuum_cleanup_index_scale_factor GUC + param.

Always skip full index scan during a VACUUM for nbtree indexes in the
case where VACUUM never called btbulkdelete(), even when pg_class stats
for the index relation would be considered "stale" by criteria applied
using vacuum_cleanup_index_scale_factor (remove the GUC and storage
param entirely).  It should be fine to rely on ANALYZE to keep
pg_class.reltuples up to date for nbtree indexes, which is the behavior
of hashvacuumcleanup()/hash indexes.

This still means that we can do a cleanup-only scan of the index for the
one remaining case where that makes sense: to recycle pages known to be
deleted but not yet recycled following a previous VACUUM.  However,
cleanup-only nbtree VACUUMS that scan the index will now be very rare.
---
 src/include/access/nbtree.h                   |  5 +-
 src/include/access/nbtxlog.h                  |  1 -
 src/include/miscadmin.h                       |  2 -
 src/backend/access/common/reloptions.c        |  9 ---
 src/backend/access/nbtree/nbtinsert.c         |  3 -
 src/backend/access/nbtree/nbtpage.c           | 40 ++++------
 src/backend/access/nbtree/nbtree.c            | 75 ++++++-------------
 src/backend/access/nbtree/nbtutils.c          |  2 -
 src/backend/access/nbtree/nbtxlog.c           |  2 +-
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 +-
 src/backend/utils/init/globals.c              |  2 -
 src/backend/utils/misc/guc.c                  | 10 ---
 src/backend/utils/misc/postgresql.conf.sample |  3 -
 src/bin/psql/tab-complete.c                   |  4 +-
 doc/src/sgml/config.sgml                      | 40 ----------
 doc/src/sgml/ref/create_index.sgml            | 14 ----
 src/test/regress/expected/btree_index.out     | 29 -------
 src/test/regress/sql/btree_index.sql          | 19 -----
 18 files changed, 43 insertions(+), 222 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 876b8f3437..0f1692fd07 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1087,8 +1087,6 @@ typedef struct BTOptions
 {
 	int32		varlena_header_;	/* varlena header (do not touch directly!) */
 	int			fillfactor;		/* page fill factor in percent (0..100) */
-	/* fraction of newly inserted tuples needed to trigger index cleanup */
-	float8		vacuum_cleanup_index_scale_factor;
 	bool		deduplicate_items;	/* Try to deduplicate items? */
 } BTOptions;
 
@@ -1191,8 +1189,7 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
  */
 extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 							 bool allequalimage);
-extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
-								 float8 num_heap_tuples);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
 extern void _bt_upgrademetapage(Page page);
 extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 3df34fcda2..0f7731856b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -54,7 +54,6 @@ typedef struct xl_btree_metadata
 	BlockNumber fastroot;
 	uint32		fastlevel;
 	uint32		last_cleanup_num_delpages;
-	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
 } xl_btree_metadata;
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..54693e047a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -261,8 +261,6 @@ extern int64 VacuumPageDirty;
 extern int	VacuumCostBalance;
 extern bool VacuumCostActive;
 
-extern double vacuum_cleanup_index_scale_factor;
-
 
 /* in tcop/postgres.c */
 
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c687d3ee9e..433e236722 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -461,15 +461,6 @@ static relopt_real realRelOpts[] =
 		},
 		0, -1.0, DBL_MAX
 	},
-	{
-		{
-			"vacuum_cleanup_index_scale_factor",
-			"Number of tuple inserts prior to index cleanup as a fraction of reltuples.",
-			RELOPT_KIND_BTREE,
-			ShareUpdateExclusiveLock
-		},
-		-1, 0.0, 1e10
-	},
 	/* list terminator */
 	{{NULL}}
 };
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 1edb9f9579..0bc86943eb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1332,8 +1332,6 @@ _bt_insertonpg(Relation rel,
 					xlmeta.fastroot = metad->btm_fastroot;
 					xlmeta.fastlevel = metad->btm_fastlevel;
 					xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
-					xlmeta.last_cleanup_num_heap_tuples =
-						metad->btm_last_cleanup_num_heap_tuples;
 					xlmeta.allequalimage = metad->btm_allequalimage;
 
 					XLogRegisterBuffer(2, metabuf,
@@ -2549,7 +2547,6 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 		md.fastroot = rootblknum;
 		md.fastlevel = metad->btm_level;
 		md.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
-		md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9d7d0186d0..97f6e39ab6 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -175,26 +175,15 @@ _bt_getmeta(Relation rel, Buffer metabuf)
  *		_bt_vacuum_needs_cleanup() to decide whether or not a btvacuumscan()
  *		call should go ahead for an entire VACUUM operation.
  *
- *		See btvacuumcleanup() and _bt_vacuum_needs_cleanup() for details of
- *		the two fields that we maintain here.
- *
- *		The information that we maintain for btvacuumcleanup() describes the
- *		state of the index (as well as the table it indexes) just _after_ the
- *		ongoing VACUUM operation.  The next _bt_vacuum_needs_cleanup() call
- *		will consider the information we saved for it during the next VACUUM
- *		operation (assuming that there will be no btbulkdelete() call during
- *		the next VACUUM operation -- if there is then the question of skipping
- *		btvacuumscan() doesn't even arise).
+ *		See btvacuumcleanup() and _bt_vacuum_needs_cleanup() for the
+ *		definition of num_delpages.
  */
 void
-_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
-					 float8 num_heap_tuples)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 {
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	bool		rewrite = false;
-	XLogRecPtr	recptr;
 
 	/*
 	 * On-disk compatibility note: The btm_last_cleanup_num_delpages metapage
@@ -209,21 +198,20 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 	 * in reality there are only one or two.  The worst that can happen is
 	 * that there will be a call to btvacuumscan a little earlier, which will
 	 * set btm_last_cleanup_num_delpages to a sane value when we're called.
+	 *
+	 * Note also that the metapage's btm_last_cleanup_num_heap_tuples field is
+	 * no longer used as of PostgreSQL 14.  We set it to -1.0 on rewrite, just
+	 * to be consistent.
 	 */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
 
-	/* Always dynamically upgrade index/metapage when BTREE_MIN_VERSION */
-	if (metad->btm_version < BTREE_NOVAC_VERSION)
-		rewrite = true;
-	else if (metad->btm_last_cleanup_num_delpages != num_delpages)
-		rewrite = true;
-	else if (metad->btm_last_cleanup_num_heap_tuples != num_heap_tuples)
-		rewrite = true;
-
-	if (!rewrite)
+	/* Don't miss chance to upgrade index/metapage when BTREE_MIN_VERSION */
+	if (metad->btm_version >= BTREE_NOVAC_VERSION &&
+		metad->btm_last_cleanup_num_delpages == num_delpages)
 	{
+		/* Usually means index continues to have num_delpages of 0 */
 		_bt_relbuf(rel, metabuf);
 		return;
 	}
@@ -240,13 +228,14 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 
 	/* update cleanup-related information */
 	metad->btm_last_cleanup_num_delpages = num_delpages;
-	metad->btm_last_cleanup_num_heap_tuples = num_heap_tuples;
+	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_metadata md;
+		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
@@ -258,7 +247,6 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages,
 		md.fastroot = metad->btm_fastroot;
 		md.fastlevel = metad->btm_fastlevel;
 		md.last_cleanup_num_delpages = num_delpages;
-		md.last_cleanup_num_heap_tuples = num_heap_tuples;
 		md.allequalimage = metad->btm_allequalimage;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -443,7 +431,6 @@ _bt_getroot(Relation rel, int access)
 			md.fastroot = rootblkno;
 			md.fastlevel = 0;
 			md.last_cleanup_num_delpages = 0;
-			md.last_cleanup_num_heap_tuples = -1.0;
 			md.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2628,7 +2615,6 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.fastroot = metad->btm_fastroot;
 			xlmeta.fastlevel = metad->btm_fastlevel;
 			xlmeta.last_cleanup_num_delpages = metad->btm_last_cleanup_num_delpages;
-			xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8aed93ff0a..89dfa005f0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -792,11 +792,8 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	Buffer		metabuf;
 	Page		metapg;
 	BTMetaPageData *metad;
-	BTOptions  *relopts;
-	float8		cleanup_scale_factor;
 	uint32		btm_version;
 	BlockNumber prev_num_delpages;
-	float8		prev_num_heap_tuples;
 
 	/*
 	 * Copy details from metapage to local variables quickly.
@@ -819,32 +816,8 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	}
 
 	prev_num_delpages = metad->btm_last_cleanup_num_delpages;
-	prev_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
 	_bt_relbuf(info->index, metabuf);
 
-	/*
-	 * If the underlying table has received a sufficiently high number of
-	 * insertions since the last VACUUM operation that called btvacuumscan(),
-	 * then have the current VACUUM operation call btvacuumscan() now.  This
-	 * happens when the statistics are deemed stale.
-	 *
-	 * XXX: We should have a more principled way of determining what
-	 * "staleness" means. The  vacuum_cleanup_index_scale_factor GUC (and the
-	 * index-level storage param) seem hard to tune in a principled way.
-	 */
-	relopts = (BTOptions *) info->index->rd_options;
-	cleanup_scale_factor = (relopts &&
-							relopts->vacuum_cleanup_index_scale_factor >= 0)
-		? relopts->vacuum_cleanup_index_scale_factor
-		: vacuum_cleanup_index_scale_factor;
-
-	if (cleanup_scale_factor <= 0 ||
-		info->num_heap_tuples < 0 ||
-		prev_num_heap_tuples <= 0 ||
-		(info->num_heap_tuples - prev_num_heap_tuples) /
-		prev_num_heap_tuples >= cleanup_scale_factor)
-		return true;
-
 	/*
 	 * Trigger cleanup in rare cases where prev_num_delpages exceeds 5% of the
 	 * total size of the index.  We can reasonably expect (though are not
@@ -993,25 +966,36 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 		/*
 		 * Since we aren't going to actually delete any leaf items, there's no
-		 * need to go through all the vacuum-cycle-ID pushups here
+		 * need to go through all the vacuum-cycle-ID pushups here.
+		 *
+		 * Posting list tuples are a source of inaccuracy for cleanup-only
+		 * scans.  btvacuumscan() will assume that the number of index tuples
+		 * from each page can be used as num_index_tuples, even though
+		 * num_index_tuples is supposed to represent the number of TIDs in the
+		 * index.  This naive approach can underestimate the number of tuples
+		 * in the index significantly.
+		 *
+		 * We handle the problem by making num_index_tuples an estimate in
+		 * cleanup-only case.
 		 */
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		stats->estimated_count = true;
 		btvacuumscan(info, stats, NULL, NULL, 0);
 	}
 
 	/*
 	 * By here, we know for sure that this VACUUM operation won't be skipping
-	 * its btvacuumscan() call.  Maintain the count of the current number of
-	 * heap tuples in the metapage.  Also maintain the num_delpages value.
-	 * This information will be used by _bt_vacuum_needs_cleanup() during
-	 * future VACUUM operations that don't need to call btbulkdelete().
+	 * its btvacuumscan() call.  Maintain  the num_delpages value.  This
+	 * information will be used by _bt_vacuum_needs_cleanup() during future
+	 * VACUUM operations that don't need to call btbulkdelete().
 	 *
 	 * num_delpages is the number of deleted pages now in the index that were
 	 * not safe to place in the FSM to be recycled just yet.  We expect that
 	 * it will almost certainly be possible to place all of these pages in the
-	 * FSM during the next VACUUM operation.  That factor alone might cause
-	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
-	 * btvacuumscan() call.
+	 * FSM during the next VACUUM operation.  _bt_vacuum_needs_cleanup() will
+	 * force the next VACUUM to consider this before allowing btvacuumscan()
+	 * to be skipped entirely.  This should be rare -- cleanup-only VACUUMs
+	 * almost always manage to skip btvacuumscan() in practice.
 	 *
 	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
 	 * VACUUM operation taking care of recycling whatever pages the current
@@ -1020,29 +1004,16 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * pages that any given VACUUM operation deletes, as part of the same
 	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
 	 * exceed 0, including with indexes where page deletions are frequent.
-	 *
-	 * Note: We must delay the _bt_set_cleanup_info() call until this late
-	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
-	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
-	 * just pg_class.reltuples for the heap relation _before_ VACUUM began.
-	 * In general cleanup info should describe the state of the index/table
-	 * _after_ VACUUM finishes.
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
-	_bt_set_cleanup_info(info->index, num_delpages, info->num_heap_tuples);
+	_bt_set_cleanup_info(info->index, num_delpages);
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
 	 * double-counting some index tuples, so disbelieve any total that exceeds
 	 * the underlying heap's count ... if we know that accurately.  Otherwise
 	 * this might just make matters worse.
-	 *
-	 * Posting list tuples are another source of inaccuracy.  Cleanup-only
-	 * btvacuumscan calls assume that the number of index tuples can be used
-	 * as num_index_tuples, even though num_index_tuples is supposed to
-	 * represent the number of TIDs in the index.  This naive approach can
-	 * underestimate the number of tuples in the index.
 	 */
 	if (!info->estimated_count)
 	{
@@ -1092,7 +1063,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * pages in the index at the end of the VACUUM command.)
 	 */
 	stats->num_pages = 0;
-	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
 	stats->pages_free = 0;
@@ -1518,7 +1488,10 @@ backtrack:
 		 * We don't count the number of live TIDs during cleanup-only calls to
 		 * btvacuumscan (i.e. when callback is not set).  We count the number
 		 * of index tuples directly instead.  This avoids the expense of
-		 * directly examining all of the tuples on each page.
+		 * directly examining all of the tuples on each page.  VACUUM will
+		 * treat num_index_tuples as an estimate in cleanup-only case, so it
+		 * doesn't matter that this underestimates num_index_tuples
+		 * significantly in some cases.
 		 */
 		if (minoff > maxoff)
 			attempt_pagedel = (blkno == scanblkno);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d524310723..fdbe0da472 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2105,8 +2105,6 @@ btoptions(Datum reloptions, bool validate)
 {
 	static const relopt_parse_elt tab[] = {
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
-		{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
 		{"deduplicate_items", RELOPT_TYPE_BOOL,
 		offsetof(BTOptions, deduplicate_items)}
 
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6ab9af4a43..8ccf1be061 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -113,7 +113,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	/* Cannot log BTREE_MIN_VERSION index metapage without upgrade */
 	Assert(md->btm_version >= BTREE_NOVAC_VERSION);
 	md->btm_last_cleanup_num_delpages = xlrec->last_cleanup_num_delpages;
-	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+	md->btm_last_cleanup_num_heap_tuples = -1.0;
 	md->btm_allequalimage = xlrec->allequalimage;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index f7cc4dd3e6..710efbd36a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -113,9 +113,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "last_cleanup_num_delpages %u; last_cleanup_num_heap_tuples: %f",
-								 xlrec->last_cleanup_num_delpages,
-								 xlrec->last_cleanup_num_heap_tuples);
+				appendStringInfo(buf, "last_cleanup_num_delpages %u",
+								 xlrec->last_cleanup_num_delpages);
 				break;
 			}
 	}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a5976ad5b1..73e0a672ae 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -148,5 +148,3 @@ int64		VacuumPageDirty = 0;
 
 int			VacuumCostBalance = 0;	/* working state for vacuum */
 bool		VacuumCostActive = false;
-
-double		vacuum_cleanup_index_scale_factor;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d626731723..783e2b0fc2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3693,16 +3693,6 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
-	{
-		{"vacuum_cleanup_index_scale_factor", PGC_USERSET, CLIENT_CONN_STATEMENT,
-			gettext_noop("Number of tuple inserts prior to index cleanup as a fraction of reltuples."),
-			NULL
-		},
-		&vacuum_cleanup_index_scale_factor,
-		0.1, 0.0, 1e10,
-		NULL, NULL, NULL
-	},
-
 	{
 		{"log_statement_sample_rate", PGC_SUSET, LOGGING_WHEN,
 			gettext_noop("Fraction of statements exceeding log_min_duration_sample to be logged."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee06528bb0..3736c972a8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -671,9 +671,6 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
-#vacuum_cleanup_index_scale_factor = 0.1	# fraction of total number of tuples
-						# before index cleanup, 0 always performs
-						# index cleanup
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9f0208ac49..ecdb8d752b 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1789,14 +1789,14 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX <foo> SET|RESET ( */
 	else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
 		COMPLETE_WITH("fillfactor",
-					  "vacuum_cleanup_index_scale_factor", "deduplicate_items", /* BTREE */
+					  "deduplicate_items", /* BTREE */
 					  "fastupdate", "gin_pending_list_limit",	/* GIN */
 					  "buffering",	/* GiST */
 					  "pages_per_range", "autosummarize"	/* BRIN */
 			);
 	else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
 		COMPLETE_WITH("fillfactor =",
-					  "vacuum_cleanup_index_scale_factor =", "deduplicate_items =", /* BTREE */
+					  "deduplicate_items =", /* BTREE */
 					  "fastupdate =", "gin_pending_list_limit =",	/* GIN */
 					  "buffering =",	/* GiST */
 					  "pages_per_range =", "autosummarize ="	/* BRIN */
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5718fc136..3cf754a236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8512,46 +8512,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
-      <term><varname>vacuum_cleanup_index_scale_factor</varname> (<type>floating point</type>)
-      <indexterm>
-       <primary><varname>vacuum_cleanup_index_scale_factor</varname></primary>
-       <secondary>configuration parameter</secondary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
-        incurring an index scan at the <command>VACUUM</command> cleanup stage.
-        This setting currently applies to B-tree indexes only.
-       </para>
-
-       <para>
-        If no tuples were deleted from the heap, B-tree indexes are still
-        scanned at the <command>VACUUM</command> cleanup stage when the
-        index's statistics are stale.  Index statistics are considered
-        stale if the number of newly inserted tuples exceeds the
-        <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
-        until <command>VACUUM</command> finds no dead tuples, so B-tree index
-        scan at the cleanup stage can only be skipped if the second and
-        subsequent <command>VACUUM</command> cycles detect no dead tuples.
-       </para>
-
-       <para>
-        The value can range from <literal>0</literal> to
-        <literal>10000000000</literal>.
-        When <varname>vacuum_cleanup_index_scale_factor</varname> is set to
-        <literal>0</literal>, index scans are never skipped during
-        <command>VACUUM</command> cleanup. The default value is <literal>0.1</literal>.
-       </para>
-
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 965dcf472c..b291b4dbc0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -456,20 +456,6 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     </note>
     </listitem>
    </varlistentry>
-
-   <varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
-    <term><literal>vacuum_cleanup_index_scale_factor</literal> (<type>floating point</type>)
-     <indexterm>
-      <primary><varname>vacuum_cleanup_index_scale_factor</varname></primary>
-      <secondary>storage parameter</secondary>
-     </indexterm>
-    </term>
-    <listitem>
-    <para>
-      Per-index value for <xref linkend="guc-vacuum-cleanup-index-scale-factor"/>.
-    </para>
-    </listitem>
-   </varlistentry>
    </variablelist>
 
    <para>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index cfd4338e36..bc113a70b4 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -308,35 +308,6 @@ alter table btree_tall_tbl alter COLUMN t set storage plain;
 create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
 insert into btree_tall_tbl select g, repeat('x', 250)
 from generate_series(1, 130) g;
---
--- Test vacuum_cleanup_index_scale_factor
---
--- Simple create
-create table btree_test(a int);
-create index btree_idx1 on btree_test(a) with (vacuum_cleanup_index_scale_factor = 40.0);
-select reloptions from pg_class WHERE oid = 'btree_idx1'::regclass;
-                reloptions                
-------------------------------------------
- {vacuum_cleanup_index_scale_factor=40.0}
-(1 row)
-
--- Fail while setting improper values
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = -10.0);
-ERROR:  value -10.0 out of bounds for option "vacuum_cleanup_index_scale_factor"
-DETAIL:  Valid values are between "0.000000" and "10000000000.000000".
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = 100.0);
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = 'string');
-ERROR:  invalid value for floating point option "vacuum_cleanup_index_scale_factor": string
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = true);
-ERROR:  invalid value for floating point option "vacuum_cleanup_index_scale_factor": true
--- Simple ALTER INDEX
-alter index btree_idx1 set (vacuum_cleanup_index_scale_factor = 70.0);
-select reloptions from pg_class WHERE oid = 'btree_idx1'::regclass;
-                reloptions                
-------------------------------------------
- {vacuum_cleanup_index_scale_factor=70.0}
-(1 row)
-
 --
 -- Test for multilevel page deletion
 --
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 96f53818ff..c60312db2d 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -150,25 +150,6 @@ create index btree_tall_idx on btree_tall_tbl (t, id) with (fillfactor = 10);
 insert into btree_tall_tbl select g, repeat('x', 250)
 from generate_series(1, 130) g;
 
---
--- Test vacuum_cleanup_index_scale_factor
---
-
--- Simple create
-create table btree_test(a int);
-create index btree_idx1 on btree_test(a) with (vacuum_cleanup_index_scale_factor = 40.0);
-select reloptions from pg_class WHERE oid = 'btree_idx1'::regclass;
-
--- Fail while setting improper values
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = -10.0);
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = 100.0);
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = 'string');
-create index btree_idx_err on btree_test(a) with (vacuum_cleanup_index_scale_factor = true);
-
--- Simple ALTER INDEX
-alter index btree_idx1 set (vacuum_cleanup_index_scale_factor = 70.0);
-select reloptions from pg_class WHERE oid = 'btree_idx1'::regclass;
-
 --
 -- Test for multilevel page deletion
 --
-- 
2.27.0

#37

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#35)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Mar 2, 2021 at 6:40 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Feb 28, 2021 at 8:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law".

I suppose I could ask Tom what he thinks?

+1

Are you going to start a new thread, or should I?

Ok, I'll start a new thread soon.

Since it seems not a bug I personally think we don't need to do
anything for back branches. But if we want not to trigger an index
scan by vacuum_cleanup_index_scale_factor, we could change the default
value to a high value (say, to 10000) so that it can skip an index
scan in most cases.

One reason to remove vacuum_cleanup_index_scale_factor in the back
branches is that it removes any need to fix the
"IndexVacuumInfo.num_heap_tuples is inaccurate outside of
btvacuumcleanup-only VACUUMs" bug -- it just won't matter if
btm_last_cleanup_num_heap_tuples is inaccurate anymore. (I am still
not sure about backpatch being a good idea, though.)

I think that removing vacuum_cleanup_index_scale_factor in the back
branches would affect the existing installation much. It would be
better to have btree indexes not use this parameter while not changing
the contents of meta page. That is, just remove the check related to
vacuum_cleanup_index_scale_factor from _bt_vacuum_needs_cleanup(). And
I personally prefer to fix the "IndexVacuumInfo.num_heap_tuples is
inaccurate outside of btvacuumcleanup-only VACUUMs" bug separately.

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

IIUC the purpose of autovacuum_vacuum_insert_scale_factor is
visibility map maintenance. And as per this discussion, it seems not
necessary to do an index scan in btvacuumcleanup() triggered by
autovacuum_vacuum_insert_scale_factor.

Arguably the question of skipping scanning the index should have been
considered by the autovacuum_vacuum_insert_scale_factor patch when it
was committed for Postgres 13 -- but it wasn't. There is a regression
that was tied to autovacuum_vacuum_insert_scale_factor in Postgres 13
by Mark Callaghan, which I suspect is relevant:

https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html

The blog post says: "Updates - To understand the small regression
mentioned above for the l.i1 test (more CPU & write IO) I repeated the
test with 100M rows using 2 configurations: one disabled index
deduplication and the other disabled insert-triggered autovacuum.
Disabling index deduplication had no effect and disabling
insert-triggered autovacuum resolves the regression."

This is quite specifically with an insert-only workload, with 4
indexes (that's from memory, but I'm pretty sure it's 4). I think that
the failure to account for skipping index scans is probably the big
problem here. Scanning the heap to set VM bits is unlikely to be
expensive compared to the full index scans. An insert-only workload is
going to find most of the heap blocks it scans to set VM bits in
shared_buffers. Not so for the indexes.

So in Postgres 13 we have this autovacuum_vacuum_insert_scale_factor
issue, in addition to the deduplication + btvacuumcleanup issue we
talked about (the problems left by my Postgres 13 bug fix commit
48e12913). These two issues make removing
vacuum_cleanup_index_scale_factor tempting, even in the back branches
-- it might actually be the more conservative approach, at least for
Postgres 13.

Yeah, this argument makes sense to me. The default values of
autovacuum_vacuum_insert_scale_factor/threshold are 0.2 and 1000
respectively whereas one of vacuum_cleanup_index_scale_factor is 0.1.
It means that in insert-only workload with default settings,
autovacuums triggered by autovacuum_vacuum_insert_scale_factor always
scan the all btree index to update the index statistics. I think most
users would not expect this behavior. As I mentioned above, I think we
can have nbtree not use this parameter or increase the default value
of vacuum_cleanup_index_scale_factor in back branches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#38

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#37)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Mar 1, 2021 at 8:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that removing vacuum_cleanup_index_scale_factor in the back
branches would affect the existing installation much. It would be
better to have btree indexes not use this parameter while not changing
the contents of meta page. That is, just remove the check related to
vacuum_cleanup_index_scale_factor from _bt_vacuum_needs_cleanup().

That's really what I meant -- we cannot just remove a GUC or storage
param in the backbranches, of course (it breaks postgresql.conf, stuff
like that). But we can disable GUCs at the code level.

And
I personally prefer to fix the "IndexVacuumInfo.num_heap_tuples is
inaccurate outside of btvacuumcleanup-only VACUUMs" bug separately.

I have not decided on my own position on the backbranches. Hopefully
there will be clear guidance from other hackers.

Yeah, this argument makes sense to me. The default values of
autovacuum_vacuum_insert_scale_factor/threshold are 0.2 and 1000
respectively whereas one of vacuum_cleanup_index_scale_factor is 0.1.
It means that in insert-only workload with default settings,
autovacuums triggered by autovacuum_vacuum_insert_scale_factor always
scan the all btree index to update the index statistics. I think most
users would not expect this behavior. As I mentioned above, I think we
can have nbtree not use this parameter or increase the default value
of vacuum_cleanup_index_scale_factor in back branches.

It's not just a problem when autovacuum_vacuum_insert_scale_factor
triggers a cleanup-only VACUUM in all indexes. It's also a problem
with cases where there is a small number of dead tuples by an
autovacuum VACUUM triggered by autovacuum_vacuum_insert_scale_factor.
It will get index scans done by btbulkdeletes() -- which are more
expensive than a VACUUM that only calls btvacuumcleanup().

Of course this is exactly what the patch you're working on for
Postgres 14 helps with. It's actually not very different (1 dead tuple
and 0 dead tuples are not very different). So it makes sense that we
ended up here -- vacuumlazy.c alone should be in control of this
stuff, because only vacuumlazy.c has the authority to see that 1 dead
tuple and 0 dead tuples should be considered the same thing (or almost
the same). So...maybe we can only truly fix the problem in Postgres 14
anyway, and should just accept that?

OTOH scanning the indexes for no reason when
autovacuum_vacuum_insert_scale_factor triggers an autovacuum VACUUM
does seem *particularly* silly. So I don't know what to think.

--
Peter Geoghegan

#39

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#37)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Mar 2, 2021 at 1:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 2, 2021 at 6:40 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Feb 28, 2021 at 8:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law".

I suppose I could ask Tom what he thinks?

+1

Are you going to start a new thread, or should I?

Ok, I'll start a new thread soon.

I've started a new thread[1]/messages/by-id/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com. Please feel free to add your thoughts.

Regards,

[1]: /messages/by-id/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#40

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#38)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Mar 2, 2021 at 1:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 1, 2021 at 8:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that removing vacuum_cleanup_index_scale_factor in the back
branches would affect the existing installation much. It would be
better to have btree indexes not use this parameter while not changing
the contents of meta page. That is, just remove the check related to
vacuum_cleanup_index_scale_factor from _bt_vacuum_needs_cleanup().

That's really what I meant -- we cannot just remove a GUC or storage
param in the backbranches, of course (it breaks postgresql.conf, stuff
like that). But we can disable GUCs at the code level.

Oh ok, I misunderstood.

And
I personally prefer to fix the "IndexVacuumInfo.num_heap_tuples is
inaccurate outside of btvacuumcleanup-only VACUUMs" bug separately.

I have not decided on my own position on the backbranches. Hopefully
there will be clear guidance from other hackers.

Yeah, this argument makes sense to me. The default values of
autovacuum_vacuum_insert_scale_factor/threshold are 0.2 and 1000
respectively whereas one of vacuum_cleanup_index_scale_factor is 0.1.
It means that in insert-only workload with default settings,
autovacuums triggered by autovacuum_vacuum_insert_scale_factor always
scan the all btree index to update the index statistics. I think most
users would not expect this behavior. As I mentioned above, I think we
can have nbtree not use this parameter or increase the default value
of vacuum_cleanup_index_scale_factor in back branches.

It's not just a problem when autovacuum_vacuum_insert_scale_factor
triggers a cleanup-only VACUUM in all indexes. It's also a problem
with cases where there is a small number of dead tuples by an
autovacuum VACUUM triggered by autovacuum_vacuum_insert_scale_factor.
It will get index scans done by btbulkdeletes() -- which are more
expensive than a VACUUM that only calls btvacuumcleanup().

Of course this is exactly what the patch you're working on for
Postgres 14 helps with. It's actually not very different (1 dead tuple
and 0 dead tuples are not very different). So it makes sense that we
ended up here -- vacuumlazy.c alone should be in control of this
stuff, because only vacuumlazy.c has the authority to see that 1 dead
tuple and 0 dead tuples should be considered the same thing (or almost
the same). So...maybe we can only truly fix the problem in Postgres 14
anyway, and should just accept that?

Yeah, I think that's right.

Perhaps we can do something so that autovacuums triggered by
autovacuum_vacuum_insert_scale_factor are triggered on only a true
insert-only case (e.g., by checking if n_dead_tup is 0).

OTOH scanning the indexes for no reason when
autovacuum_vacuum_insert_scale_factor triggers an autovacuum VACUUM
does seem *particularly* silly.

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#41

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#40)

Re: 64-bit XIDs in deleted nbtree pages

On Sun, Mar 7, 2021 at 8:52 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yeah, I think that's right.

Perhaps we can do something so that autovacuums triggered by
autovacuum_vacuum_insert_scale_factor are triggered on only a true
insert-only case (e.g., by checking if n_dead_tup is 0).

Right -- that's really what it would mean to "remove
vacuum_cleanup_index_scale_factor in the backbranches".

I now think that it won't even be necessary to make many changes
within VACUUM ANALYZE to avoid unwanted side-effects from removing
vacuum_cleanup_index_scale_factor, per my mail to Tom today:

/messages/by-id/CAH2-WzknxdComjhqo4SUxVFk_Q1171GJO2ZgHZ1Y6pion6u8rA@mail.gmail.com

I'm starting to lean towards "removing
vacuum_cleanup_index_scale_factor" in Postgres 13 and master only,
purely to fix the two issues in Postgres 13 (the insert-driven vacuum
issue and the deduplication stats issue I go into in the mail I link
to). A much more conservative approach should be used to fix the more
superficial issue -- the issue of getting an accurate value (for
pg_class.teltuples) from "info->num_heap_tuples". As discussed
already, the conservative fix is to delay reading
"info->num_heap_tuples" until btvacuumcleanup(), even in cases where
there are btbulkdelete() calls for the VACUUM.

Then we can then revisit your patch to make vacuumlazy.c skip index
vacuuming when there are very few dead tuples, but more than 0 dead
tuples [1]/messages/by-id/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com -- Peter Geoghegan. I should be able to commit that for Postgres 14.

(I will probably finish off my other patch to make nbtree VACUUM
recycle pages deleted during the same VACUUM operation last of all.)

[1]: /messages/by-id/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#42

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#36)

1 attachment(s)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Mar 1, 2021 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v8 of the patch series, which has new patches. No real
changes compared to v7 for the first patch, though.

Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I'll commit this when we get your patch committed. Still haven't
decided on exactly how more aggressive we should be. For example the
use of the heap relation within _bt_newly_deleted_pages_recycle()
might have unintended consequences for recycling efficiency with some
workloads, since it doesn't agree with _bt_getbuf() (it is still "more
ambitious" than _bt_getbuf(), at least for now).

--
Peter Geoghegan

Attachments:

v9-0001-Recycle-pages-deleted-during-same-VACUUM.patchapplication/octet-stream; name=v9-0001-Recycle-pages-deleted-during-same-VACUUM.patchDownload

From 300a59a92e30305101103bd816d4f5c2d50cdf1d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 25 Feb 2021 15:17:22 -0800
Subject: [PATCH v9] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
---
 src/include/access/nbtree.h         | 22 ++++++-
 src/backend/access/nbtree/README    | 31 +++++++++
 src/backend/access/nbtree/nbtpage.c | 40 ++++++++++++
 src/backend/access/nbtree/nbtree.c  | 97 +++++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtxlog.c | 22 +++++++
 5 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 5c66d1f366..8517b6026c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -316,14 +317,33 @@ BTPageIsRecyclable(Page page)
  * BTVacState is private nbtree.c state used during VACUUM.  It is exported
  * for use by page deletion related code in nbtpage.c.
  */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
 typedef struct BTVacState
 {
+	/*
+	 * VACUUM operation state
+	 */
 	IndexVacuumInfo *info;
 	IndexBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
 	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
 } BTVacState;
 
 /*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 4a0578dff4..a6c43b940e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2677,6 +2677,46 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
+
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c70647d6f3..bf50c7c265 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,6 +34,7 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
@@ -833,6 +836,71 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -927,6 +995,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * FSM during the next VACUUM operation.  _bt_vacuum_needs_cleanup() will
 	 * force the next VACUUM to consider this before allowing btvacuumscan()
 	 * to be skipped entirely.
+	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
 	 */
 	Assert(stats->pages_deleted >= stats->pages_free);
 	num_delpages = stats->pages_deleted - stats->pages_free;
@@ -1002,6 +1078,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1070,7 +1156,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 1779b6ba47..192c6e03ce 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)

base-commit: 5f8727f5a679452f7bbdd6966a1586934dcaa84f
-- 
2.27.0

#43

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#42)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I committed the final nbtree page deletion patch just now -- the one
that attempts to make recycling happen for newly deleted pages. Thanks
for all your work on patch review, Masahiko!

I'll close out the CF item for this patch series now.

--
Peter Geoghegan

#44

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#43)

Re: 64-bit XIDs in deleted nbtree pages

On Mon, Mar 22, 2021 at 7:27 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 10, 2021 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I committed the final nbtree page deletion patch just now -- the one
that attempts to make recycling happen for newly deleted pages. Thanks
for all your work on patch review, Masahiko!

You're welcome! Those are really good improvements.

By this patch series, btree indexes became like hash indexes in terms
of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
two cases: metapage upgrading and more than 5%
deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
we want to disable parallel index cleanup for btree indexes like hash
indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
amparallelvacuumoptions. IMO we can live with the current
configuration just in case where the user runs into such rare
situations (especially for the latter case). In most cases, parallel
vacuum workers for index cleanup might exit with no-op but the
side-effect (wasting resources and overhead etc) would not be big. If
we want to enable it only in particular cases, we would need to have
another way for index AM to tell lazy vacuum whether or not to allow a
parallel worker to process the index at that time. What do you think?

I’m not sure we need changes but I think it’s worth discussing here.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#45

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#44)

Re: 64-bit XIDs in deleted nbtree pages

On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

By this patch series, btree indexes became like hash indexes in terms
of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
two cases: metapage upgrading and more than 5%
deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
we want to disable parallel index cleanup for btree indexes like hash
indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
amparallelvacuumoptions.

My recent "Recycle nbtree pages deleted during same VACUUM" commit
improved the efficiency of recycling, but I still think that it was a
bit of a hack. Or at least it didn't go far enough in fixing the old
design, which is itself a bit of a hack.

As I said back on February 15, a truly good design for nbtree page
deletion + recycling would have crash safety built in. If page
deletion itself is crash safe, it really makes sense to make
everything crash safe (especially because we're managing large chunks
of equisized free space, unlike in heapam). And as I also said back
then, a 100% crash-safe design could naturally shift the problem of
nbtree page recycle safety from the producer/VACUUM side, to the
consumer/_bt_getbuf() side. It should be completely separated from
when VACUUM runs, and what VACUUM can discover about recycle safety in
passing, at the end.

That approach would completely eliminate the need to do any work in
btvacuumcleanup(), which would make it natural to remove
VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation
of btvacuumcleanup() would just look like hashvacuumcleanup() does now
-- it could do practically nothing, making this 100% okay.

For now I have my doubts that it is appropriate to make this change.
It seems as if the question of whether or not
VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the
same question as "Does the vacuumcleanup() callback for this index AM
look exactly like hashvacuumcleanup()?".

IMO we can live with the current
configuration just in case where the user runs into such rare
situations (especially for the latter case). In most cases, parallel
vacuum workers for index cleanup might exit with no-op but the
side-effect (wasting resources and overhead etc) would not be big. If
we want to enable it only in particular cases, we would need to have
another way for index AM to tell lazy vacuum whether or not to allow a
parallel worker to process the index at that time. What do you think?

I am concerned about unintended consequences, like never noticing that
we should really recycle known deleted pages not yet placed in the FSM
(it's hard to think through very rare cases like this with
confidence). Is it really so bad if we launch parallel workers that we
don't really need for a parallel VACUUM?

--
Peter Geoghegan

#46

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#45)

Re: 64-bit XIDs in deleted nbtree pages

On Wed, Mar 24, 2021 at 12:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 23, 2021 at 8:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

By this patch series, btree indexes became like hash indexes in terms
of amvacuumcleanup. We do an index scan at btvacuumcleanup() in the
two cases: metapage upgrading and more than 5%
deleted-but-not-yet-recycled pages. Both cases seem rare cases. So do
we want to disable parallel index cleanup for btree indexes like hash
indexes? That is, remove VACUUM_OPTION_PARALLEL_COND_CLEANUP from
amparallelvacuumoptions.

My recent "Recycle nbtree pages deleted during same VACUUM" commit
improved the efficiency of recycling, but I still think that it was a
bit of a hack. Or at least it didn't go far enough in fixing the old
design, which is itself a bit of a hack.

As I said back on February 15, a truly good design for nbtree page
deletion + recycling would have crash safety built in. If page
deletion itself is crash safe, it really makes sense to make
everything crash safe (especially because we're managing large chunks
of equisized free space, unlike in heapam). And as I also said back
then, a 100% crash-safe design could naturally shift the problem of
nbtree page recycle safety from the producer/VACUUM side, to the
consumer/_bt_getbuf() side. It should be completely separated from
when VACUUM runs, and what VACUUM can discover about recycle safety in
passing, at the end.

That approach would completely eliminate the need to do any work in
btvacuumcleanup(), which would make it natural to remove
VACUUM_OPTION_PARALLEL_COND_CLEANUP from nbtree -- the implementation
of btvacuumcleanup() would just look like hashvacuumcleanup() does now
-- it could do practically nothing, making this 100% okay.

For now I have my doubts that it is appropriate to make this change.
It seems as if the question of whether or not
VACUUM_OPTION_PARALLEL_COND_CLEANUP should be used is basically the
same question as "Does the vacuumcleanup() callback for this index AM
look exactly like hashvacuumcleanup()?".

IMO we can live with the current
configuration just in case where the user runs into such rare
situations (especially for the latter case). In most cases, parallel
vacuum workers for index cleanup might exit with no-op but the
side-effect (wasting resources and overhead etc) would not be big. If
we want to enable it only in particular cases, we would need to have
another way for index AM to tell lazy vacuum whether or not to allow a
parallel worker to process the index at that time. What do you think?

I am concerned about unintended consequences, like never noticing that
we should really recycle known deleted pages not yet placed in the FSM
(it's hard to think through very rare cases like this with
confidence). Is it really so bad if we launch parallel workers that we
don't really need for a parallel VACUUM?

I don't think it's too bad even if we launch parallel workers for
indexes that don’t really need to be processed by parallel workers.
Parallel workers exit immediately after all indexes are vacuumed so it
would not affect other parallel operations. There is nothing change in
terms of in terms of DSM usage since btree indexes support parallel
bulkdelete.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/