New IndexAM API controlling index vacuum strategies

Started by Masahiko Sawadaabout 5 years ago130 messages

sawada.mshk@gmail.com

about 5 years ago

1 attachment(s)

Hi all,

I've started this separate thread from [1]/messages/by-id/20200415233848.saqp72pcjv2y6ryi@alap3.anarazel.de for discussing the general
API design of index vacuum.

Summary:

* Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
false, and leave it to the index AM whether or not skip them.
* Add a new index AM API amvacuumstrategy(), asking the index AM the
strategy before calling to ambulkdelete.
* Whether or not remove garbage tuples from heap depends on multiple
factors including INDEX_CLEANUP option and the answers of
amvacuumstrategy() for each index AM.

The first point is to fix the inappropriate behavior discussed on the thread[1]/messages/by-id/20200415233848.saqp72pcjv2y6ryi@alap3.anarazel.de.

The second and third points are to introduce a general framework for
future extensibility. User-visible behavior is not changed by this
change.

The new index AM API, amvacuumstrategy(), which is called before
bulkdelete() for each index and asks the index bulk-deletion strategy.
On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
tuples during heap scanning, how urgent is vacuuming for you?", and
the index answers either "it's urgent" when it wants to do
bulk-deletion or "it's not urgent, I can skip it". The point of this
proposal is to isolate heap vacuum and index vacuum for each index so
that we can employ different strategies for each index. Lazy vacuum
can decide whether or not to do heap clean based on the answers from
the indexes.

By default, if all indexes answer 'yes' (meaning it will do
bulkdelete()), lazy vacuum can do heap clean. On the other hand, if
even one index answers 'no' (meaning it will not do bulkdelete()),
lazy vacuum doesn't the heap clean. Lazy vacuum would also be able to
require indexes to do bulkdelete() for some reason such as specyfing
INDEX_CLEANUP option by the user. It’s something like saying "Hey
index X, you answered not to do bulkdelete() but since heap clean is
necessary for me please don't skip bulkdelete()".

Currently, if INDEX_CLEANUP option is not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But with this patch we use the default as a neutral state
('smart' mode). This neutral state could be "on" and "off" depending
on several factors including the answers of amvacuumstrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

Also there are potential enhancements using this API:

* If bottom-up index deletion feature[2]/messages/by-id/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com is introduced, individual
indexes could be a different situation in terms of dead tuple
accumulation; some indexes on the table can delete its garbage index
tuples without bulkdelete(). A problem will appear that doing
bulkdelete() for such indexes would not be efficient. This problem is
solved by this proposal because we can do bulkdelete() for a subset of
indexes on the table.

* If retail index deletion feature[3]/messages/by-id/425db134-8bba-005c-b59d-56e50de3b41e@postgrespro.ru is introduced, we can make the
return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
"do_indexscandelete", and "no".

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

I’ve attached the PoC patch for the above idea. By default, since lazy
vacuum choose the vacuum bulkdelete strategy based on answers of
amvacuumstrategy() so it can be either true or false ( although it’s
always true in the currene patch). But for amvacuumcleanup() there is
no the neutral state, lazy vacuum treats the default as true.

Comment and feedback are very welcome.

Regards,

[1]: /messages/by-id/20200415233848.saqp72pcjv2y6ryi@alap3.anarazel.de
[2]: /messages/by-id/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
[3]: /messages/by-id/425db134-8bba-005c-b59d-56e50de3b41e@postgrespro.ru

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

poc_vacuumstrategy.patchapplication/octet-stream; name=poc_vacuumstrategy.patchDownload

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 23aa7ac441..e07b71a336 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -201,6 +201,7 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 26b9927c3a..4ea0cfc1d8 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 3282adde03..32150493ee 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,15 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -45,6 +54,13 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BloomMetaPageData *metaData;
 	GenericXLogState *gxlogState;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
@@ -172,7 +188,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	BlockNumber npages,
 				blkno;
 
-	if (info->analyze_only)
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (stats == NULL)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c60..707c096e81 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = brinvacuumstrategy;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -770,10 +771,20 @@ brinbuildempty(Relation index)
 	UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+brinvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * brinbulkdelete
  *		Since there are no per-heap-tuple index tuples in BRIN indexes,
- *		there's not a lot we can do here.
+ *		there's not a lot we can do here regardless of
+ *		info->bulkdelete_skippable.
  *
  * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
  * tuple is deleted), meaning the need to re-run summarization on the affected
@@ -799,8 +810,11 @@ brinvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	heapRel;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (!stats)
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd36..09d1cf5694 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 0935a6d9e5..bcb804f3ce 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,15 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
@@ -571,6 +580,13 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
 	uint32		nRoot;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
 									   "Gin vacuum temporary context",
 									   ALLOCSET_DEFAULT_SIZES);
@@ -708,6 +724,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		return stats;
 	}
 
+	/* Skip index cleanup if user requests to disable */
+	if (!info->vacuumcleanup_requested)
+		return stats;
+
 	/*
 	 * Set up all-zero stats and cleanup pending inserts if ginbulkdelete
 	 * wasn't called
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f2b416ce1..f7d100255d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a9c616c772..40ff75b1ad 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,15 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
@@ -59,6 +68,13 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -74,8 +90,11 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c..0ed2bd6717 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -443,6 +444,15 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -468,6 +478,13 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	tuples_removed = 0;
 	num_index_tuples = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b..93c4488e39 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -214,6 +214,18 @@ typedef struct LVShared
 	double		reltuples;
 	bool		estimated_count;
 
+	/*
+	 * Copied from LVRelStats. It tells index AM that lazy vacuum will remove
+	 * dead tuples from the heap after index vacuum.
+	 */
+	bool vacuum_heap;
+
+	/*
+	 * Copied from LVRelStats. It tells index AM whether amvacuumcleanup is
+	 * requested or not.
+	 */
+	bool vacuumcleanup_requested;
+
 	/*
 	 * In single process lazy vacuum we could consume more memory during index
 	 * vacuuming or cleanup apart from the memory for heap scanning.  In
@@ -293,8 +305,8 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -310,9 +322,11 @@ typedef struct LVRelStats
 	double		tuples_deleted;
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
+	bool		vacuum_heap;	/* do we remove dead tuples from the heap? */
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		vacuumcleanup_requested; /* INDEX_CLEANUP is set to false */
 
 	/* Used for error callback */
 	char	   *indname;
@@ -343,6 +357,12 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void choose_vacuum_strategy(LVRelStats *vacrelstats, VacuumParams *params,
+								   Relation *Irel, int nindexes);
+static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+										  LVRelStats *vacrelstats, Relation *Irel,
+										  int nindexes, IndexBulkDeleteResult **stats,
+										  LVParallelState *lps);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -442,7 +462,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -501,8 +520,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrelstats->hasindex = (nindexes > 0);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -811,14 +829,23 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	/* index vacuum cleanup is enabled if index cleanup is not
+	 * disabled, i.g., either default or enabled.
+	 */
+	vacrelstats->vacuumcleanup_requested =
+		(params->index_cleanup != VACOPT_TERNARY_DISABLED);
+
 	vistest = GlobalVisTestFor(onerel);
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * least two indexes on a table. When the index cleanup is disabled,
+	 * since index bulk-deletions are likely to be no-op we disable a parallel
+	 * vacuum.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1 &&
+		params->index_cleanup != VACOPT_TERNARY_DISABLED)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -1050,19 +1077,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Vacuum the table and its indexes */
+			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+										  Irel, nindexes, indstats,
+										  lps);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1515,29 +1533,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * doing a second scan. Also we don't do that but forget dead tuples
 		 * when index cleanup is disabled.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrelstats->hasindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_tuples = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(nindexes == 0);
+
+			/* Remove tuples from heap if the table has no index */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+			has_dead_tuples = false;
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1702,14 +1705,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+									  Irel, nindexes, indstats,
+									  lps);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1722,7 +1720,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1775,6 +1773,103 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+							  LVRelStats *vacrelstats, Relation *Irel,
+							  int nindexes, IndexBulkDeleteResult **indstats,
+							  LVParallelState *lps)
+{
+	/*
+	 * Choose the vacuum strategy for this vacuum cycle.
+	 * choose_vacuum_strategy will set the decision to
+	 * vacrelstats->vacuum_heap.
+	 */
+	choose_vacuum_strategy(vacrelstats, params, Irel, nindexes);
+
+	/* Work on all the indexes, then the heap */
+	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+							nindexes);
+
+	if (vacrelstats->vacuum_heap)
+	{
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * Here, we don't do heap vacuum in this cycle.
+		 *
+		 * Note that vacrelstats->dead_tuples could have tuples which
+		 * became dead after HOT-pruning but are not marked dead yet.
+		 * We do not process them because it's a very rare condition,
+		 * and the next vacuum will process them anyway.
+		 */
+		Assert(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+}
+
+/*
+ * Decide whether or not we remove the collected garbage tuples from the
+ * heap.
+ */
+static void
+choose_vacuum_strategy(LVRelStats *vacrelstats, VacuumParams *params,
+					   Relation *Irel, int nindexes)
+{
+	bool vacuum_heap = true;
+
+	/*
+	 * If index cleanup option is specified, we use it.
+	 *
+	 * XXX: should we call amvacuumstrategy even if INDEX_CLEANUP
+	 * is specified?
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+		vacuum_heap = true;
+	else if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		vacuum_heap = false;
+	else
+	{
+		int i;
+
+		/*
+		 * If index cleanup option is not specified, we decide the vacuum
+		 * strategy based on the returned values from amvacuumstrategy.
+		 * If even one index returns 'none', we skip heap vacuum in this
+		 * vacuum cycle.
+		 */
+		for (i = 0; i < nindexes; i++)
+		{
+			IndexVacuumStrategy ivacstrat;
+			IndexVacuumInfo ivinfo;
+
+			ivinfo.index = Irel[i];
+			/* XXX: fill other fields */
+
+			ivacstrat = index_vacuum_strategy(&ivinfo);
+
+			if (ivacstrat == INDEX_VACUUM_NONE)
+			{
+				vacuum_heap = false;
+				break;
+			}
+		}
+	}
+
+	vacrelstats->vacuum_heap = vacuum_heap;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *
@@ -2120,6 +2215,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 	 */
 	nworkers = Min(nworkers, lps->pcxt->nworkers);
 
+	/* Copy the information to the shared state */
+	lps->lvshared->vacuum_heap = vacrelstats->vacuum_heap;
+	lps->lvshared->vacuumcleanup_requested = vacrelstats->vacuumcleanup_requested;
+
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
@@ -2444,6 +2543,13 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.vacuumcleanup_requested = vacrelstats->vacuumcleanup_requested;
+
+	/*
+	 * index bulk-deletion can be skipped safely if we won't delete
+	 * garbage tuples from the heap.
+	 */
+	ivinfo.bulkdelete_skippable = !(vacrelstats->vacuum_heap);
 
 	/*
 	 * Update error traceback information.
@@ -2461,11 +2567,16 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) dead_tuples);
 
-	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+	/*
+	 * XXX: we don't want to report if ambulkdelete was no-op because of
+	 * bulkdelete_skippable. But we cannot know it was or not.
+	 */
+	if (*stats)
+		ereport(elevel,
+				(errmsg("scanned index \"%s\" to remove %d row versions",
+						vacrelstats->indname,
+						dead_tuples->num_tuples),
+				 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
@@ -2495,9 +2606,10 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.report_progress = false;
 	ivinfo.estimated_count = estimated_count;
 	ivinfo.message_level = elevel;
-
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.bulkdelete_skippable = false;
+	ivinfo.vacuumcleanup_requested = vacrelstats->vacuumcleanup_requested;
 
 	/*
 	 * Update error traceback information.
@@ -2844,14 +2956,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2881,7 +2993,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
@@ -3573,6 +3685,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacrelstats.indname = NULL;
 	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
 
+	vacrelstats.vacuum_heap = lvshared->vacuum_heap;
+	vacrelstats.vacuumcleanup_requested = lvshared->vacuumcleanup_requested;
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrelstats;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3fb8688f8f..8df683c640 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -676,6 +676,25 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - decide whether or not to bulkdelete
+ *
+ * This callback routine is called just before calling ambulkdelete.
+ * Returns IndexVacuumStrategy to tell the lazy vacuum whether we do
+ * bulkdelete.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(amvacuumstrategy);
+
+	return indexRelation->rd_indam->amvacuumstrategy(info);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 0abec10798..38d6a60199 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -821,6 +822,18 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 */
 		result = true;
 	}
+	else if (!info->vacuumcleanup_requested)
+	{
+		/*
+		 * Skip cleanup if INDEX_CLEANUP is set to false, even if there might
+		 * be a deleted page that can be recycled. If INDEX_CLEANUP continues
+		 * to be disabled, recyclable pages could be left by XID wraparound.
+		 * But in practice it's not so harmful since such workload doesn't need
+		 * to delete and recycle pages in any case and deletion of btree index
+		 * pages is relatively rare.
+		 */
+		result = false;
+	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
@@ -863,6 +876,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -877,6 +899,13 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Relation	rel = info->index;
 	BTCycleId	cycleid;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 64d3ba8288..b18858a50e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index e1c58933f9..9aafcf9347 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,15 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -907,6 +916,13 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 {
 	spgBulkDeleteState bds;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (info->bulkdelete_skippable)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -937,8 +953,11 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	spgBulkDeleteState bds;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..abd8d1844e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3401,6 +3401,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.bulkdelete_skippable = false;
+	ivinfo.vacuumcleanup_requested = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..4e46e920cf 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -668,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.vacuumcleanup_requested = true;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..6a182ba9cd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1870,14 +1870,18 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
+	/* Set index cleanup option if either reloptions or INDEX_CLEANUP vacuum
+	 * command option is set.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
 	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
+		if (onerel->rd_options != NULL)
+		{
+			if (((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
+				params->index_cleanup = VACOPT_TERNARY_ENABLED;
+			else
+				params->index_cleanup = VACOPT_TERNARY_DISABLED;
+		}
 	}
 
 	/* Set truncate option based on reloptions if not yet */
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 85b4766016..f885c6ac67 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -111,6 +111,8 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   Relation heapRelation,
 								   IndexUniqueCheck checkUnique,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -258,6 +260,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 9ffc9100c0..cdf98489cf 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -97,6 +97,7 @@ extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void brinendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy brinvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *brinbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 68d90f5141..eea3a28411 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -34,7 +34,8 @@ typedef struct IndexBuildResult
 } IndexBuildResult;
 
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -47,6 +48,22 @@ typedef struct IndexVacuumInfo
 	bool		analyze_only;	/* ANALYZE (without any actual vacuum) */
 	bool		report_progress;	/* emit progress.h status reports */
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
+
+	/*
+	 * Is this ambulkdelete call is skippable? If true, since lazy vacuum
+	 * won't delete the garbage tuples from the heap, the index AM can
+	 * skip index bulk-deletion safely. This field is used only when
+	 * ambulkdelete.
+	 */
+	bool		bulkdelete_skippable;
+
+	/*
+	 * amvacuumcleanup is requested by lazy vacuum. If false, the index AM
+	 * can skip index cleanup. This can be false if INDEX_CLEANUP vacuum option
+	 * is set to false. This field is used only when amvacuumcleanup.
+	 */
+	bool		vacuumcleanup_requested;
+
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
@@ -125,6 +142,13 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_NONE,		/* No-op, skip bulk-deletion in this vacuum cycle */
+	INDEX_VACUUM_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -173,6 +197,7 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 5cb2f72e4c..21e7282e36 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -396,6 +396,7 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index b68c01a5f2..3d191f241d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -532,6 +532,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bab4d9f1b0..a9b99a6fa3 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -371,6 +371,7 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e8fecc6026..7f74066b44 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1008,6 +1008,7 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 9f2ccc1730..33cc62f489 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -211,6 +211,7 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..d96e6b6239 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -218,8 +218,10 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup. In
+										 * default mode, it's decided based on
+										 * multiple factors. See
+										 * choose_vacuum_strategy. */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Masahiko Sawada (#1)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've started this separate thread from [1] for discussing the general
API design of index vacuum.

This is a very difficult and very important problem. Clearly defining
the problem is probably the hardest part. This prototype patch seems
like a good start, though.

Private discussion between Masahiko and myself led to a shared
understanding of what the best *general* direction is for VACUUM now.
It is necessary to deal with several problems all at once here, and to
at least think about several more problems that will need to be solved
later. If anybody reading the thread initially finds it hard to see
the connection between the specific items that Masahiko has
introduced, they should note that that's *expected*.

Summary:

* Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
false, and leave it to the index AM whether or not skip them.

Makes sense. I like the way you unify INDEX_CLEANUP and the
vacuum_cleanup_index_scale_factor stuff in a way that is now quite
explicit and obvious in the code.

The second and third points are to introduce a general framework for
future extensibility. User-visible behavior is not changed by this
change.

In some ways the ideas in your patch might be considered radical, or
at least novel: they introduce the idea that bloat can be a
qualitative thing. But at the same time the design is quite
conservative: these are fairly isolated changes, at least code-wise. I
am not 100% sure that this approach will be successful in
vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our
collective understanding of the problems in this area will be
significantly improved by this effort. A fundamental rethink does not
necessarily require a fundamental redesign, and yet it might be just
as effective.

This is certainly what I see when testing my bottom-up index deletion
patch, which adds an incremental index deletion mechanism that merely
intervenes in a precise, isolated way. Despite my patch's simplicity,
it manages to practically eliminate an entire important *class* of
index bloat (at least once you make certain mild assumptions about the
duration of snapshots). Sometimes it is possible to solve a hard
problem by thinking about it only *slightly* differently.

This is a tantalizing possibility for VACUUM, too. I'm willing to risk
sounding grandiose if that's what it takes to get more hackers
interested in these questions. With that in mind, here is a summary of
the high level hypothesis behind this VACUUM patch:

VACUUM can and should be reimagined as a top-down mechanism that
complements various bottom-up mechanisms (including the stuff from my
deletion patch, heap pruning, and possibly an enhanced version of heap
pruning based on similar principles). This will be possible without
changing any of the fundamental invariants of the current vacuumlazy.c
design. VACUUM's problems are largely pathological behaviors of one
kind or another, that can be fixed with specific well-targeted
interventions. Workload characteristics can naturally determine how
much of the cleanup is done by VACUUM itself -- large variations are
possible within a single database, and even across indexes on the same
table.

The new index AM API, amvacuumstrategy(), which is called before
bulkdelete() for each index and asks the index bulk-deletion strategy.
On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
tuples during heap scanning, how urgent is vacuuming for you?", and
the index answers either "it's urgent" when it wants to do
bulk-deletion or "it's not urgent, I can skip it". The point of this
proposal is to isolate heap vacuum and index vacuum for each index so
that we can employ different strategies for each index. Lazy vacuum
can decide whether or not to do heap clean based on the answers from
the indexes.

Right -- workload characteristics (plus appropriate optimizations at
the local level) make it possible that amvacuumstrategy() will give
*very* different answers from different indexes *on the same table*.
The idea that all indexes on the table are more or less equally
bloated at any given point in time is mostly wrong. Actually,
*sometimes* it really is correct! But other times it is *dramatically
wrong* -- it all depends on workload characteristics. What is likely
to be true *on average* across all tables/indexes is *irrelevant* (the
mean/average is simply not a useful concept, in fact).

The basic lazy vacuum design needs to recognize this important
difference, and other similar issues. That's the point of
amvacuumstrategy().

Currently, if INDEX_CLEANUP option is not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But with this patch we use the default as a neutral state
('smart' mode). This neutral state could be "on" and "off" depending
on several factors including the answers of amvacuumstrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

So you have several different kinds of back pressure - 'smart' mode
really is smart.

Also there are potential enhancements using this API:

* If retail index deletion feature[3] is introduced, we can make the
return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
"do_indexscandelete", and "no".

Makes sense.

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

Good idea. I bet other people can come up with other ideas a little
like this just by thinking about it. The "untangling" performed by
your patch creates many possibilities

I’ve attached the PoC patch for the above idea. By default, since lazy
vacuum choose the vacuum bulkdelete strategy based on answers of
amvacuumstrategy() so it can be either true or false ( although it’s
always true in the currene patch). But for amvacuumcleanup() there is
no the neutral state, lazy vacuum treats the default as true.

As you said, the next question must be: How do we teach lazy vacuum to
not do what gets requested by amvacuumcleanup() when it cannot respect
the wishes of one individual indexes, for example when the
accumulation of LP_DEAD items in the heap becomes a big problem in
itself? That really could be the thing that forces full heap
vacuuming, even with several indexes.

I will need to experiment in order to improve my understanding of how
to make this cooperate with bottom-up index deletion. But that's
mostly just a question for my patch (and a relatively easy one).

--
Peter Geoghegan

Masahiko Sawada

sawada.mshk@gmail.com

about 5 years ago

In reply to: Peter Geoghegan (#2)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Dec 24, 2020 at 12:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've started this separate thread from [1] for discussing the general
API design of index vacuum.

This is a very difficult and very important problem. Clearly defining
the problem is probably the hardest part. This prototype patch seems
like a good start, though.

Private discussion between Masahiko and myself led to a shared
understanding of what the best *general* direction is for VACUUM now.
It is necessary to deal with several problems all at once here, and to
at least think about several more problems that will need to be solved
later. If anybody reading the thread initially finds it hard to see
the connection between the specific items that Masahiko has
introduced, they should note that that's *expected*.

Summary:

* Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
false, and leave it to the index AM whether or not skip them.

Makes sense. I like the way you unify INDEX_CLEANUP and the
vacuum_cleanup_index_scale_factor stuff in a way that is now quite
explicit and obvious in the code.

The second and third points are to introduce a general framework for
future extensibility. User-visible behavior is not changed by this
change.

In some ways the ideas in your patch might be considered radical, or
at least novel: they introduce the idea that bloat can be a
qualitative thing. But at the same time the design is quite
conservative: these are fairly isolated changes, at least code-wise. I
am not 100% sure that this approach will be successful in
vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our
collective understanding of the problems in this area will be
significantly improved by this effort. A fundamental rethink does not
necessarily require a fundamental redesign, and yet it might be just
as effective.

This is certainly what I see when testing my bottom-up index deletion
patch, which adds an incremental index deletion mechanism that merely
intervenes in a precise, isolated way. Despite my patch's simplicity,
it manages to practically eliminate an entire important *class* of
index bloat (at least once you make certain mild assumptions about the
duration of snapshots). Sometimes it is possible to solve a hard
problem by thinking about it only *slightly* differently.

This is a tantalizing possibility for VACUUM, too. I'm willing to risk
sounding grandiose if that's what it takes to get more hackers
interested in these questions. With that in mind, here is a summary of
the high level hypothesis behind this VACUUM patch:

VACUUM can and should be reimagined as a top-down mechanism that
complements various bottom-up mechanisms (including the stuff from my
deletion patch, heap pruning, and possibly an enhanced version of heap
pruning based on similar principles). This will be possible without
changing any of the fundamental invariants of the current vacuumlazy.c
design. VACUUM's problems are largely pathological behaviors of one
kind or another, that can be fixed with specific well-targeted
interventions. Workload characteristics can naturally determine how
much of the cleanup is done by VACUUM itself -- large variations are
possible within a single database, and even across indexes on the same
table.

Agreed.

Ideally, the bottom-up mechanism works well and reclaim almost all
garbage. VACUUM should be a feature that complements these works if
the bottom-up mechanism cannot work well for some reason, and also is
used to make sure that all collected garbage has been vacuumed. For
heaps, we already have such a mechanism: opportunistically hot-pruning
and lazy vacuum. For indexes especially btree indexes, the bottom-up
index deletion and ambulkdelete() would have a similar relationship.

The new index AM API, amvacuumstrategy(), which is called before
bulkdelete() for each index and asks the index bulk-deletion strategy.
On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
tuples during heap scanning, how urgent is vacuuming for you?", and
the index answers either "it's urgent" when it wants to do
bulk-deletion or "it's not urgent, I can skip it". The point of this
proposal is to isolate heap vacuum and index vacuum for each index so
that we can employ different strategies for each index. Lazy vacuum
can decide whether or not to do heap clean based on the answers from
the indexes.

Right -- workload characteristics (plus appropriate optimizations at
the local level) make it possible that amvacuumstrategy() will give
*very* different answers from different indexes *on the same table*.
The idea that all indexes on the table are more or less equally
bloated at any given point in time is mostly wrong. Actually,
*sometimes* it really is correct! But other times it is *dramatically
wrong* -- it all depends on workload characteristics. What is likely
to be true *on average* across all tables/indexes is *irrelevant* (the
mean/average is simply not a useful concept, in fact).

The basic lazy vacuum design needs to recognize this important
difference, and other similar issues. That's the point of
amvacuumstrategy().

Agreed.

In terms of bloat, the characteristics of index AM also bring such
differences (e.g., btree vs. brin). With the bottom-up index deletion
feature, even btree indexes on the same table will also different to
each other.

Currently, if INDEX_CLEANUP option is not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But with this patch we use the default as a neutral state
('smart' mode). This neutral state could be "on" and "off" depending
on several factors including the answers of amvacuumstrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

So you have several different kinds of back pressure - 'smart' mode
really is smart.

Also there are potential enhancements using this API:

* If retail index deletion feature[3] is introduced, we can make the
return value of bulkvacuumstrategy() a ternary value: "do_bulkdelete",
"do_indexscandelete", and "no".

Makes sense.

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

Good idea. I bet other people can come up with other ideas a little
like this just by thinking about it. The "untangling" performed by
your patch creates many possibilities

I’ve attached the PoC patch for the above idea. By default, since lazy
vacuum choose the vacuum bulkdelete strategy based on answers of
amvacuumstrategy() so it can be either true or false ( although it’s
always true in the currene patch). But for amvacuumcleanup() there is
no the neutral state, lazy vacuum treats the default as true.

As you said, the next question must be: How do we teach lazy vacuum to
not do what gets requested by amvacuumcleanup() when it cannot respect
the wishes of one individual indexes, for example when the
accumulation of LP_DEAD items in the heap becomes a big problem in
itself? That really could be the thing that forces full heap
vacuuming, even with several indexes.

You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
think amvacuumstrategy() affects only ambulkdelete(). But when all
ambulkdelete() were skipped by the requests by index AMs we might want
to skip amvacuumcleanup() as well.

I will need to experiment in order to improve my understanding of how
to make this cooperate with bottom-up index deletion. But that's
mostly just a question for my patch (and a relatively easy one).

Yeah, I think we might need something like statistics about garbage
per index so that individual index can make a different decision based
on their status. For example, a btree index might want to skip
ambulkdelete() if it has a few dead index tuples in its leaf pages. It
could be on stats collector or on btree's meta page.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Masahiko Sawada (#3)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As you said, the next question must be: How do we teach lazy vacuum to
not do what gets requested by amvacuumcleanup() when it cannot respect
the wishes of one individual indexes, for example when the
accumulation of LP_DEAD items in the heap becomes a big problem in
itself? That really could be the thing that forces full heap
vacuuming, even with several indexes.

You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
think amvacuumstrategy() affects only ambulkdelete(). But when all
ambulkdelete() were skipped by the requests by index AMs we might want
to skip amvacuumcleanup() as well.

No, I was asking about how we should decide to do a real VACUUM even
(a real ambulkdelete() call) when no index asks for it because
bottom-up deletion works very well in every index. Clearly we will
need to eventually remove remaining LP_DEAD items from the heap at
some point if nothing else happens -- eventually LP_DEAD items in the
heap alone will force a traditional heap vacuum (which will still have
to go through indexes that have not grown, just to be safe/avoid
recycling a TID that's still in the index).

Postgres heap fillfactor is 100 by default, though I believe it's 90
in another well known DB system. If you set Postgres heap fill factor
to 90 you can fit a little over 200 LP_DEAD items in the "extra space"
left behind in each heap page after initial bulk loading/INSERTs take
place that respect our lower fill factor setting. This is about 4x the
number of initial heap tuples in the pgbench_accounts table -- it's
quite a lot!

If we pessimistically assume that all updates are non-HOT updates,
we'll still usually have enough space for each logical row to get
updated several times before the heap page "overflows". Even when
there is significant skew in the UPDATEs, the skew is not noticeable
at the level of individual heap pages. We have a surprisingly large
general capacity to temporarily "absorb" extra garbage LP_DEAD items
in heap pages this way. Nobody really cared about this extra capacity
very much before now, because it did not help with the big problem of
index bloat that you naturally see with this workload. But that big
problem may go away soon, and so this extra capacity may become
important at the same time.

I think that it could make sense for lazy_scan_heap() to maintain
statistics about the number of LP_DEAD items remaining in each heap
page (just local stack variables). From there, it can pass the
statistics to the choose_vacuum_strategy() function from your patch.
Perhaps choose_vacuum_strategy() will notice that the heap page with
the most LP_DEAD items encountered within lazy_scan_heap() (among
those encountered so far in the event of multiple index passes) has
too many LP_DEAD items -- this indicates that there is a danger that
some heap pages will start to "overflow" soon, which is now a problem
that lazy_scan_heap() must think about. Maybe if the "extra space"
left by applying heap fill factor (with settings below 100) is
insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the
heap page that has the most LP_DEAD items (among all heap pages), we
stop caring about what amvacuumstrategy()/the indexes say. So we do
the right thing for the heap pages, while still mostly avoiding index
vacuuming and the final heap pass.

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

I will need to experiment in order to improve my understanding of how
to make this cooperate with bottom-up index deletion. But that's
mostly just a question for my patch (and a relatively easy one).

Yeah, I think we might need something like statistics about garbage
per index so that individual index can make a different decision based
on their status. For example, a btree index might want to skip
ambulkdelete() if it has a few dead index tuples in its leaf pages. It
could be on stats collector or on btree's meta page.

Right. I think that even a very conservative approach could work well.
For example, maybe we teach nbtree's amvacuumstrategy() routine to ask
to do a real ambulkdelete(), except in the extreme case where the
index is *exactly* the same size as it was after the last VACUUM.
This will happen regularly with bottom-up index deletion. Maybe that
approach is a bit too conservative, though.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Peter Geoghegan (#4)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

--
Peter Geoghegan

Attachments:

0007-btvacuumstrategy-bottom-up-index-deletion-changes.patchapplication/octet-stream; name=0007-btvacuumstrategy-bottom-up-index-deletion-changes.patchDownload

From 5ae5dde505ded1f555324382f9db6e7fbd114492 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 23 Dec 2020 20:42:53 -0800
Subject: [PATCH 7/8] btvacuumstrategy() bottom-up index deletion changes

---
 src/backend/access/heap/vacuumlazy.c | 69 +++++++++++++++++++++++++---
 src/backend/access/nbtree/nbtree.c   | 35 ++++++++++++--
 src/backend/commands/vacuum.c        |  6 +++
 3 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 93c4488e39..c45c49d561 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -358,11 +358,13 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
 static void choose_vacuum_strategy(LVRelStats *vacrelstats, VacuumParams *params,
-								   Relation *Irel, int nindexes);
+								   Relation *Irel, int nindexes, double live_tuples,
+								   int maxdeadpage);
 static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
 										  LVRelStats *vacrelstats, Relation *Irel,
 										  int nindexes, IndexBulkDeleteResult **stats,
-										  LVParallelState *lps);
+										  LVParallelState *lps, double live_tuples,
+										  int maxdeadpage);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -781,6 +783,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
+	int			maxdeadpage = 0; /* controls if we skip heap vacuum scan */
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
 				tups_vacuumed,	/* tuples cleaned up by vacuum */
@@ -1080,7 +1083,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* Vacuum the table and its indexes */
 			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
 										  Irel, nindexes, indstats,
-										  lps);
+										  lps, live_tuples, maxdeadpage);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1666,6 +1669,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+		else
+			maxdeadpage = Max(maxdeadpage,
+							  dead_tuples->num_tuples - prev_dead_count);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1707,7 +1713,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (dead_tuples->num_tuples > 0)
 		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
 									  Irel, nindexes, indstats,
-									  lps);
+									  lps, live_tuples, maxdeadpage);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1780,14 +1786,16 @@ static void
 lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
 							  LVRelStats *vacrelstats, Relation *Irel,
 							  int nindexes, IndexBulkDeleteResult **indstats,
-							  LVParallelState *lps)
+							  LVParallelState *lps, double live_tuples,
+							  int maxdeadpage)
 {
 	/*
 	 * Choose the vacuum strategy for this vacuum cycle.
 	 * choose_vacuum_strategy will set the decision to
 	 * vacrelstats->vacuum_heap.
 	 */
-	choose_vacuum_strategy(vacrelstats, params, Irel, nindexes);
+	choose_vacuum_strategy(vacrelstats, params, Irel, nindexes, live_tuples,
+						   maxdeadpage);
 
 	/* Work on all the indexes, then the heap */
 	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
@@ -1825,7 +1833,8 @@ lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
  */
 static void
 choose_vacuum_strategy(LVRelStats *vacrelstats, VacuumParams *params,
-					   Relation *Irel, int nindexes)
+					   Relation *Irel, int nindexes, double live_tuples,
+					   int maxdeadpage)
 {
 	bool vacuum_heap = true;
 
@@ -1865,6 +1874,52 @@ choose_vacuum_strategy(LVRelStats *vacrelstats, VacuumParams *params,
 				break;
 			}
 		}
+
+		/*
+		 * XXX: This 130 test is for the maximum number of LP_DEAD items on
+		 * any one heap page encountered during heap scan by caller.  The
+		 * general idea here is to preserve the original pristine state of the
+		 * table when it is subject to constant non-HOT updates when heap fill
+		 * factor is reduced from its default.
+		 *
+		 * If we do this right (and with bottom-up index deletion), the
+		 * overall effect for non-HOT-update heavy workloads is that both
+		 * table and indexes (or at least a subset of indexes on the table
+		 * that are never logically modified by the updates) never grow even
+		 * by one block.  We can actually make those things perfectly stable
+		 * over time in the absence of queries that hold open MVCC snapshots
+		 * for a long time.  Stability is perhaps the most important thing
+		 * here (not performance per se).
+		 *
+		 * The exact number used here (130) is based on the assumption that
+		 * heap fillfactor is set to 90 in this table -- we can fit roughly
+		 * 200 "extra" LP_DEAD items on heap pages before they start to
+		 * "overflow" with that setting (e.g. before a pgbench_accounts table
+		 * that is subject to constant non-HOT updates needs to allocate new
+		 * pages just for new versions).  We're trying to avoid having VACUUM
+		 * call lazy_vacuum_heap() in most cases, but we don't want to be too
+		 * aggressive: it would be risky to make the value we test for much
+		 * higher/closer to ~200, since it might be too late by the time we
+		 * actually call lazy_vacuum_heap().  (Unsure of this, but that's the
+		 * idea, at least.)
+		 *
+		 * Since we're mostly worried about stability over time here, we have
+		 * to be worried about "small" effects.  If there are just a few heap
+		 * page overflows in each VACUUM cycle, that still means that heap
+		 * page overflows are _possible_.  It is perhaps only a matter of time
+		 * until the heap becomes almost as fragmented as it would with a heap
+		 * fill factor of 100 -- so "small" effects may be really important.
+		 * (Just guessing here, but I can say for sure that the bottom-up
+		 * deletion patch works that way, so it is an "educated guess".)
+		 */
+		if (!vacuum_heap)
+		{
+			if (maxdeadpage > 130 ||
+				/* Also check if maintenance_work_mem space is running out */
+				vacrelstats->dead_tuples->num_tuples >
+				vacrelstats->dead_tuples->max_tuples / 2)
+				vacuum_heap = true;
+		}
 	}
 
 	vacrelstats->vacuum_heap = vacuum_heap;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 420457c1a2..ee071cb463 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,12 +878,35 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Currently always do ambulkdelete.
+ * Choose the vacuum strategy
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info)
 {
-	return INDEX_VACUUM_BULKDELETE;
+	Relation	rel = info->index;
+
+	/*
+	 * This strcmp() is a quick and dirty prototype of logic that decides
+	 * whether or not the index needs to get a bulk deletion during this
+	 * VACUUM.  A real version of this logic could work by remembering the
+	 * size of the index during the last VACUUM.  It would only return
+	 * INDEX_VACUUM_BULKDELETE to choose_vacuum_strategy()/vacuumlazy.c iff it
+	 * found that the index is now larger than it was last time around, even
+	 * by one single block.  (It could get a lot more sophisticated than that,
+	 * for example by trying to understand UPDATEs vs DELETEs, but a very
+	 * simple approach is probably almost as useful to users.)
+	 *
+	 * Further details on the strcmp() and my benchmarking:
+	 *
+	 * The index named abalance_ruin is the only index that receives logical
+	 * changes in my pgbench benchmarks.  It is one index among several on
+	 * pgbench_accounts.  It covers the abalance column, which makes almost
+	 * 100% of all UPDATEs non-HOT UPDATEs.
+	 */
+	if (strcmp(RelationGetRelationName(rel), "abalance_ruin") == 0)
+		return INDEX_VACUUM_BULKDELETE;
+
+	return INDEX_VACUUM_NONE;
 }
 
 /*
@@ -903,8 +926,14 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/*
 	 * Skip deleting index entries if the corresponding heap tuples will
 	 * not be deleted.
+	 *
+	 * XXX: Maybe we need to remember the decision made in btvacuumstrategy()
+	 * in an AM-generic way, or using some standard idiom that is owned by the
+	 * index AM?  The strcmp() here repeats work done in btvacuumstrategy(),
+	 * which is not ideal.
 	 */
-	if (info->bulkdelete_skippable)
+	if (info->bulkdelete_skippable &&
+		strcmp(RelationGetRelationName(rel), "abalance_ruin") != 0)
 		return NULL;
 
 	/* allocate stats if first time through, else re-use existing struct */
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6a182ba9cd..223b7cb820 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1875,6 +1875,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
 	{
+		/*
+		 * XXX had to comment this out to get choose_vacuum_strategy() to do
+		 * the right thing
+		 */
+#if 0
 		if (onerel->rd_options != NULL)
 		{
 			if (((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
@@ -1882,6 +1887,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 			else
 				params->index_cleanup = VACOPT_TERNARY_DISABLED;
 		}
+#endif
 	}
 
 	/* Set truncate option based on reloptions if not yet */
-- 
2.27.0

Masahiko Sawada

sawada.mshk@gmail.com

about 5 years ago

In reply to: Peter Geoghegan (#4)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Dec 28, 2020 at 4:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 10:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As you said, the next question must be: How do we teach lazy vacuum to
not do what gets requested by amvacuumcleanup() when it cannot respect
the wishes of one individual indexes, for example when the
accumulation of LP_DEAD items in the heap becomes a big problem in
itself? That really could be the thing that forces full heap
vacuuming, even with several indexes.

You mean requested by amvacuumstreategy(), not by amvacuumcleanup()? I
think amvacuumstrategy() affects only ambulkdelete(). But when all
ambulkdelete() were skipped by the requests by index AMs we might want
to skip amvacuumcleanup() as well.

No, I was asking about how we should decide to do a real VACUUM even
(a real ambulkdelete() call) when no index asks for it because
bottom-up deletion works very well in every index. Clearly we will
need to eventually remove remaining LP_DEAD items from the heap at
some point if nothing else happens -- eventually LP_DEAD items in the
heap alone will force a traditional heap vacuum (which will still have
to go through indexes that have not grown, just to be safe/avoid
recycling a TID that's still in the index).

Postgres heap fillfactor is 100 by default, though I believe it's 90
in another well known DB system. If you set Postgres heap fill factor
to 90 you can fit a little over 200 LP_DEAD items in the "extra space"
left behind in each heap page after initial bulk loading/INSERTs take
place that respect our lower fill factor setting. This is about 4x the
number of initial heap tuples in the pgbench_accounts table -- it's
quite a lot!

If we pessimistically assume that all updates are non-HOT updates,
we'll still usually have enough space for each logical row to get
updated several times before the heap page "overflows". Even when
there is significant skew in the UPDATEs, the skew is not noticeable
at the level of individual heap pages. We have a surprisingly large
general capacity to temporarily "absorb" extra garbage LP_DEAD items
in heap pages this way. Nobody really cared about this extra capacity
very much before now, because it did not help with the big problem of
index bloat that you naturally see with this workload. But that big
problem may go away soon, and so this extra capacity may become
important at the same time.

I think that it could make sense for lazy_scan_heap() to maintain
statistics about the number of LP_DEAD items remaining in each heap
page (just local stack variables). From there, it can pass the
statistics to the choose_vacuum_strategy() function from your patch.
Perhaps choose_vacuum_strategy() will notice that the heap page with
the most LP_DEAD items encountered within lazy_scan_heap() (among
those encountered so far in the event of multiple index passes) has
too many LP_DEAD items -- this indicates that there is a danger that
some heap pages will start to "overflow" soon, which is now a problem
that lazy_scan_heap() must think about. Maybe if the "extra space"
left by applying heap fill factor (with settings below 100) is
insufficient to fit perhaps 2/3 of the LP_DEAD items needed on the
heap page that has the most LP_DEAD items (among all heap pages), we
stop caring about what amvacuumstrategy()/the indexes say. So we do
the right thing for the heap pages, while still mostly avoiding index
vacuuming and the final heap pass.

Agreed. I like the idea that we calculate how many LP_DEAD items we
can absorb based on the extra space left by applying the fill factor.
Since there is a limit on the maximum number of line pointers in a
heap page we might need to consider that limit when calculation.

From another point of view, given the maximum number of heap tuple in
one 8kb heap page (MaxHeapTuplesPerPage) is 291, I think how bad to
store LP_DEAD items in a heap page vary depending on the tuple size.

For example, suppose the tuple size is 200 we can store 40 tuples into
one heap page if there is no LP_DEAD item at all. Even if there are
150 LP_DEAD items on the page, we still are able to store 37 tuples
because we still can have 141 line pointers at most, which is enough
number to store the maximum number of heap tuples when there are no
LP_DEAD items, and we have (8192 - (4 * 150)) bytes space to store
tuples (with line pointers). That is, we can think that having 150
LP_DEAD items end up causing an overflow of 3 tuples. On the other
hand, suppose the tuple size is 40 we can store 204 tuples into one
heap page if there is no LP_DEAD item at all. If there are 150 LP_DEAD
items on the page, we are able to store 141 tuples. That is, having
150 LP_DEAD items end up causing an overflow of 63 tuples. I think
the impact on the table bloat by absorbing LP_DEAD items is larger in
the latter case.

The larger the tuple size, the more LP_DEAD items can be absorbed in a
heap page with less bad effect. Considering 32 bytes tuple, the
minimum heap tuples size including the tuple header, absorbing
approximately up to 70 LP_DEAD items would not affect much in terms of
bloat. In other words, if a heap page has more than 70 LP_DEAD items,
absorbing LP_DEAD items may become a problem of the table bloat. This
threshold of 70 LP_DEAD items is a conservative value and probably
would be a lower bound. If the tuple size is larger, we may be able to
absorb more LP_DEAD items.

FYI I've attached a graph showing how the number of LP_DEAD items on
one heap page affects the maximum number of heap tuples on the same
heap page. The X-axis is the number of LP_DEAD items in one heap page
and the Y-axis is the number of heap tuples that can be stored on the
page. The lines in the graph are heap tuple size respectively. For
example, in pgbench workload, since the tuple size is about 120 bytes
the page bloat accelerates if we leave more than about 230 LP_DEAD
items in a heap page.

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Agreed.

I will need to experiment in order to improve my understanding of how
to make this cooperate with bottom-up index deletion. But that's
mostly just a question for my patch (and a relatively easy one).

Yeah, I think we might need something like statistics about garbage
per index so that individual index can make a different decision based
on their status. For example, a btree index might want to skip
ambulkdelete() if it has a few dead index tuples in its leaf pages. It
could be on stats collector or on btree's meta page.

Right. I think that even a very conservative approach could work well.
For example, maybe we teach nbtree's amvacuumstrategy() routine to ask
to do a real ambulkdelete(), except in the extreme case where the
index is *exactly* the same size as it was after the last VACUUM.
This will happen regularly with bottom-up index deletion. Maybe that
approach is a bit too conservative, though.

Agreed.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

lp_dead.pngimage/png; name=lp_dead.pngDownload

�PNG


IHDR����K�	pHYs�� IDATx���w\����]B��{���PX������
�uE�u�m������E�V��uT������E�*"�,��@r���"�J4�2^�G�G8.�{%�}{��}B�,K@�h��#4`��
�h�<@�0T��w��<c����TB������p��mB������3r����U@K�����O�:5$$���_�5!����Eu���s�����yyyC����JII���P�A��P%�����]�6,,����~�������wo��� 33�c�����FGG'%%�i��a���o�_�
���������1cFdd��������n��u�����w���m���bffv������s�Z�v���K


�------UI7���B������-����-z]~~>��B�������������*k����=z��������	!�O����quu


uqqi�����y��������L��NNNYYY�����h��KIIaY����� �oQ���oQ���N�������ddd�����]VV����]&�,Z�h��)!!!iii������/ccc}||��/!������599Y%It���Y����������oQ���o��5l��N���i������s�!{�����~������###?��c���0kk���ooo�P��"�3�4���'O�<Yq����CBB


����-����144411Q�A��jp�AU��(^y�bii��$~��� �Eo���Fx�j���Fx�����L����N�n�X�����7���k;������=�7o^�����]���=~�x�����{��U�s��=S�N}�PIII��l����@���w���#F,\����S��i��������',,���gU{���B�]����o���_��+��~e����3`�*�����u���]�z���O	!������pppX�bE|||�������������r��%
4?~��;wJKK7n�`��;v���[ ���}�����_~����SSRR�
��w��{�
���W�[��W�d�����h����_�l��U�! �,X���j�*���v���s^^���g���3}���O�����������NNN'N�t������I�<==W�\�����{�������o����������t777n4�%�:+V��50�IDDD���?���5k�,^���k��o�����������m�6m��j�j��qIIIaaa{��IMM3fL�>����?�9s&EQFFF�n�4h��c�����������K��K������@��o��������'%%B:4m������h����o�����|��q//��c����#&&f������srr������7l����0o���M�>z�h����?>s�LLL/�.p�0i�����K$OO�_��2q�����N�:B��Y3e�nO�@`nn��ys33��K�
ooony���{�������X�""""//�_�~3g�4665jTiii���{������
�j"���?���T���R)M��w��?~��A{��
|���TTTT}�!�����%���pj"j�}	!Us�	!�v���������oi�V���mY�
4��u����z���4`P�����y������������=zDGG+���(T����Eu���s���d�����t��q��Y�/_��KQ�Lzzz���������!���[�zupp���O��;���_��j����x��H%�"��+��K%C��:���EEE������^�x������������V�Z)��/E��%$Uxr���|�rWW����m��-_����6++���-;;����T�e��W��N[U��:,Z������$--������k�_�533�������X�
�a����~�)7�*,,��q�FGGG�@��g��=���?~��R��z��Z�r��%������trr�y��������������I,EY3,E	�'d2YAA�D9����������(TC (v_B�H$���,E	�;,E	�������c�����"�-<��(To��=z���uknn.�n�I,E	�z���111����(�T��������Sy7�B%/�������J�B�zQu�eY�D�����oQ|�j��T�Y��B�z'�_Yx��-���v)�f6�f6"���Z�g���/<��(�����@  �-<��(�%���Ob)J�z����X�@sa)JP
�86H���%��di��d�bZ*a�j`LW�2BZZ"�&t�P���I$�%K����N�0����?��������J)((�;w���
Y��={�����?����a�������]��qL���g�(��{���%Y7��*��<�;2����$  `��%g��!�L�6---�O�>aaa��=#X����R�8��j���{�����^�Xa����I����8U�UC[Ay�L���"4������a�#��H$�n���9���k��v��EDD888�X�">>���UmKQVd�Ud���E��D��5�W���������]{�
D���U����RB	+�WB�������b�����/���������;%]�`!d��U4Mw��U�KQ�l	I�_����y��v�QL����)����w%\u�7��\���8�l)W]�LRcK����4��;-���%��]�v}��7g����R�R�"� �c��^d�ik~=�����Z��+
���������h�T0�1J@�2y���P��e���`�z���:4m������h����-E���r������+��-Ey�����(;v��������kKQv��u��uM�6�~�����}}}[�l����5=j�u%2�'U6?3.�<Gf�,,I����9�+0�w6�+�cga���!5�16�F�Z�r�������:u"��Y�f��)�������}����_�KQ���).E���Q�ehh���-�e^^^�~�f��ill<j��������c)����R���%�f�L\E%�*��EE)�j�!*L�W��c����/��;8�%��K��������"��`_��� \U�8V�Qy����[\j��b��T#n!xX�T����$��NO�y�6�wWbd/,{.������t���1�N��g�z��m&�����(��5��X�FY�������'�\���@ �dV��b��0��c
��;��
��U@��k�:u_BH��
�Y����"��d�E/*���8�8}���������KTO/@���;�������9������'�|)J4`-���S�v6z�����HV�EhLKK���"_�U#;aY�������_q�E||���SW�\�k�����z��
���C��5kV�v����_��=Wq)J��f1�o�����o��)������Y�
X��&6N��K����I)F�
��T"�W#JZ&�b+��v��^��13h�����k�����4��O?B^_xRmKQJ%�d�T��"_��W%C������������~)���9�7��w���������ggg/^�x���/<���(5����vm��mv!!������s��x�~*,,���o�n���w_YxR,�m)J�Xe��*�{I�f��BcG���0�i�I��3%|'MWVV����]&/����<q�wG��[�pKQ?~�j)�;v����9���eBB��y��6m�������?~����3111��u��-(((00��P��'����w
�h���\�vvv,�������B���


�����e�����������M���m��<x0�x���"�����s��������Q�����suu������Kii��_r]���������������� �M^_xR����%hx����3�����'��%�}���(��b)J�;�>�K�[p�Wr�S|g��KQ��^��mf����w������i���	��T��@�����l"�T�"&�Y��R��h
4`P�&�^���pw�Y�0����=S�J.Faj4@���A��68�-�����@�������kf���C��h44`P=�����4�hz����;��B�za�\<6���G���������������cF?K:����B��5�f��G�o���;�fA�z�c�KY�����=	�?h��+�������AM���n2��t�S��h4`P�A��>��b_�/�;���A���Hht|�����=	�0����0�L��a�������7h�����=�=���y|�0�&p�K��������h���n�\!�F`j4�4`�Y�Om0��I�oT��w��<c����TBHAAADDD�=����<��O��#G���������[������nIZ%�Y�D5
8>>~���!!!7n�����	!.��������Y�._����7t�P//�������t�M+�q9Mc#�����h�B��RZZ�v����0����'B��s�V�^����s��egg���FGG'%%�i��a�����������r?�$h���a�?z<.I�jj�w��������r����--���.��������egg/^�8<<\"���y������U���/_�t������a�{��)>�}�J"�����a��/��
�^i^uja�9&�����[�n����[[���,77���l'''�[�nBJJJ�����hiiiaa��W_�*	h���.��0;3`��Y�$(((((�{W���3������`ooo��3!�k�����kff��'/]�����{�n�tU��������w����3�_~����s���{�Z[[?�<22288�������������y@@����P(����UrP�m��Z���{%�?��w����3������\���999���&&&��u���e�����+������&��|gx����z\�C$)v_B����+���,��'U��	�~���,����@�Q4�����G����2h������g��U�|P
4`������J�NM�;�
��6����U����|xWh��eZN�i3�vg�|x'h��}��7�w�c{���R���%4`�J^�Qw���&�����@+���Qc3��X����|���0h���?=Y�C�A�
���;\%��������A�����=��aO�P[h��#���j9�f��}I���,5C���o<"���AOrn����h��S���!g�{�,���,o�:h��F���^�������A7�w.xTqea�A��:���Nf�����h������l>��t(nO��:������-�Y��(
��A���2��P���w%|g�C�`d/����')'���@0��	�)�Y��w4`�3]�:�$��X��w�wh��w��F�����I�'4`�G���8u1>5��A@����k�|��V���,�Y@���r	2r�s���G|g��z���`�m�#������I�Vh�$4����y����@}��!��>�����i|}� �q��kw�����z
�?��-�������A@������I���>(����t0��,�D��z��ML����,�����!0��f4��<����|g����G���,�������&��p�(���4�� �k��j��{/z0�0@�ZM�q�a��u"�A@w���G��\t~\��'�
(m��o�������C��?��H���0�/��-sn������<}��-[�������{����:3h,���8���e��{���Mi�i����BQTrr��h����>����9|-���3�[�n���O�.]���L���7VJb#��I����~��W���Dggguf�
���$��{vJz�����Gi��~~>!�����gg�>�@}��A�O���.�����k^|g-����7���/�Dr������]�vm��A���B����69�0��i%�Y@�(m�����v����<t�PVV��?�|��u&�v~F=~q��SR��b����P�����!W�\���IJJj����h#{��'M��Ytg��jE�g��W��9sfFF�����5k��������3�����).2����6sm���Nin����'�~���PG-��c���O�����,���6�?����o�-//gY���m�������ZN�6q���z0(��3�e��;v���g�����XPK�����o��'��|g
��{xx�d�&M���a����jn8�a�}�U��4��<|�����:t�����N��t���`�m�#��ScK��G�g���WAAA������j��SB��$����$k���Qz�����G����5��!������_�"�� �A�6�)S���9���+,�<y�:3���u�Fv�S#��4�������{�	???���Z}j�������|���3`oo�q��^�t�������h��Q��&����>���w������jll�����qcB��g�4����9���N�u��Q�Y4�x������3���h$�$i��SR�o���-zJ�g�P(5������+�;�
�7�O{d^*���s���6`�L�����������R���G�5N��^���<y��s��5n�����G�Vg&��n�����{Z'��J�$����}��B���G�1���hn�):��q�nF�J��]��8  �������g��m���:3�![_���<v�L��]�wP��NHH��gM��������'��$��CPg6==�Y��FC-���Ki������8=u=9�ii���T��@=R��/_���?���B���6m���Tz���[�sb#��68�����������p�B�X<y�d�M@�|?��kcxvJ:�A��(m�
4���200���{JJ�:3!��'����}�w�Jp�^����/oo�a����X2�
z�v��|��I���,�bJ?#�=x����K���?v~F����h�`@����1�q@e��gddDFF��=���<99�6c]�xq���YYY������y����������g�����
P'���y>���&(�;���<~�������\OO�������v����G��[����BN�<IQT���;w�lmm���7t�P//�������W���X���C�����K��}��c��������?��TCi�H$}���i���,##������cbblll(�"�������9886������������;v��p��0��5�Rm�]��p������9)�*�M���@���AI�����~��K��c�&%%�7�i��54u�T����������(??�����/^�t��������0��{����������O���cPPP`��l��^�6�����v�������Z�3���u�������z�3���r������f���r�����R�B�
M�5�:�q�w{���~���u�g��u��[���=NNN�����s�6�/������			�5���{]3-_�������544t��m
6�u�!���������Qqg�e��������lI	EKE�}QY�����p��+�Vq!D@QQ��L++���G@�2�QZ	%#����[�Pa}���4�Gx�b�u�_���J~��Y��b�"
-�6(e{�d���/���B�bq�.]j3��������4i��K�)S�������w���z�
�p�Bll���_��n���,[�7���"����o����r���MbaNSK������cO�u� IDATi�Embi�o�s/s�G�9\u66O+-�)�r�i��;0������:���@�4B]���3�#F���BLMMk��
!d����������XX���u@@����P(�k��9-��i����I.��2gy�KeU��P�HM��iW�
+�-EF�e\�����n�f�C<|�=�����O�o��;�V#�&�S@�(m�yyy�O���p�����RPP`mm�m9t�PNN��������e������o�R�v��e	�����1�_��(n',���Y)eES}:���@+�����/��\[�s��@�T�����$�����!������C������5�������]:4 ��'�y"���r�V��Jl�')u5�|V��Uo��Y\��t������!!/�������T4�����>)L�h=���,P�|�l���7o*n�����aC=%P�g����c�MCW��7R���J(���|����v<��z�%2�����+�+��V�0����������E�=���@���)���6h��U&�s����c����Z����s������������<�"��������70�!�A���&a��2�	
S����+�V��<}Lq����TZ�U-�`�U����q�������0���hKs���>���w��g�l�4��>}La��?���#77�0,K,
�[��`���SB�s�I��.��	���@
�t��������qc�H��@��>mV���j��A����*���>����m�j#s���lgc�����z�\d4�E��n>*��3,�'I��������M��J)m�?�����;mll�����9��X��<L�t��$���c�
(J�r}������i"Z0�������|�U�^�k����b���9M�����r��p���l��p��@��6�={����M�8���?::Z��@�Y����*�R�����s������f6IE9-���ep����zN*W���\�Nmc�r#G^�nw3���<���v<�s���i���=����{W�����,P
�
X,?~�������Z�.xE��w���'��Z�!�LZI1�����b�@�TD��
�*.yfb .��(n1����-��9z�{�w�y������E�F��4���f������������k��3��u�k?p��7OW2!�HhP&���������Vl�-)����YfY��j�YVlod��E�54�./���I��t�{K~_W���I��4�<�Fh�P_��fch�^Z�x~L13UJ�jgh�U^�UWc�g����)��\�0�J.���������WgsW�+�j��>�'N�Z�����bmm}������w} ����#�hZ�0\5�����"ZP��Da�L��
*d�
F�����D�u���uN�E	#�5���77h�
x��y���o�_8
����P�8v���Ol�+=��m���p�����Y�������Jq26K/-�j3K����\��s��S���������n�Oi���i�k�h�.
��9�P�����-�.1#gf�}�gJW�255���>|����#""��	^���K�;o����	��d4xkJ'aeffVVVr����S=%�$,������e�|6�� ����L������6n��_�l���-��/[�w�����1���+����{���I���	�`Fh��m��/���^D���S����}��B���?t�P5F5��Y��N�_���-���5�q���3`�����/�Dr�����@�4t�b>^������;�~Q��W�^�m�����u���^�Z��@�.GM�.!v��g��G�%�}�����E���@+�l4���>
in�����|�h���o&���}��K������NRRRii)!����I�&��|�AsG[�F�W�������qt��<~������M�B����,Y��T��.<�na�������k��w���'&&bb��rt4�������3~2��qt�����w�~�����������Buf
�����^���$�At������v������������&p�j�7�yY������kp0�l���
����w�����'L����8i�����pc>�^[9�o�W;����	�2�����9::����d2nM������g�����'���K����1�Y4N]�Y5�����	!fff*�: �]����n��p�1���$�w����,-3f|10f��'��������Rg}����/7��A���<i�������SK�1Qn���[�;�A�U5
���8((���A/x{{�?h�?����[�[��A�R5��LMMccc��_�x�����hx��������w-SM�t����o��]�^��m100���Uo0��C��i�"�yN�_�8�������|��)���gffr[�����@�V>�%�K:��0�]�=}����i��G�vss3f������6��W��4[�]bF����|���L�Z�paJJ��i��_���U2�:w�����1a�A��h�j���u�Vtt���'��o�m�����iS=%�R�:f���R���� ���U]���oC:z�hpp��r)��{V�������y3��>*�6$�z�/���_����$��/<�XX������~5R�����<��h"4`�/
����W������|g�8J��c���={��i���huf]���"�^>���k|�,J���v����?���%$�{���=�;�D��I���w
���[�n���	����Ruf����P�L:r���h
�
x��1k���?��+W���L�������A����@#T�%'     ����I�&�:l���l��Y����O���3�g�'O�l����!C��;�s�Nuf62�wc�����w���;��6��K�^�r����K�.111����}k���+��q8�>�Yx��IX�6����#�����s�]��<�Y�����?�S�N�N�����7o�:3���7oV��G����;�6���������y�������Wg&����[vZ��� �V�,��;w&&&*n155�5k��"�~Y;�������\�t�����>�4�V�Z���*n�D���(bh�VW
7�K����1�q����=��������g�6m��
�N@��_5����nc:�q��	ZN�g��|���������`����g�3�'K��_����I�?�O�JX�.]:q����?TW�w�����t����E��;@=Rz��u�'O�B222����	�]L��[yi����w�z��?{��u��~~~��5;w�\����o���d�����������|�/J/Ao��U�1^�`|�=�=�n����w�S��7l�����=655�>}��"�M����������`g�wURz	�E����k���H$z���:3T���q����o�t�r2�YTI�p����}���,h����e����W,]�<���|�P
�
���#�������L5F�F����������r\7�����K���������������Wg&�j��OV���'��J���{DDDDD����_�n��7��m�F����j�T*=z�hTT���G�=z������?@��O�3��E�����N��X&������������o��V����>*��e���N^$P|�x�4`�X<k���zyy�?@mt�w�w���b�O������8u��3`t_�p
\-�D}6"n���������6`��d��N>�?��?�P7oj�eee�=*))Q[����(�L:r�������w�^??�9s��k�n��-�����c�����"�DDD���#::��������3r�����D��e����
�,��� ���o��1>>���OHH������G�[�n���%�,\�0&&�c���f��|�r^^���C���RRR"""T�
!�L����A���-�.E��iS�a!,�z{{�8P~~~LLL�~�(�"��;wn������O�>=w�\vv���otttRRR�6m��i������.�~


T�+�32������2�v�9-���t_\\\ll,�899������UzLQ�����A�<<<>|��o��0u����{����D"�y����/!�U�V�/_�t�����'�0���S|.����xE�����W;��p�}����{�y���)=�7o���3�XYY�rP�Xlkk���������������r��-BHIIIYY���c�����_}�U�C��@@%D��Y��������|�]�=����SVzleeu����W�^�z���;��k�����kff��'/]�����{�n������u�������;@��6����[�n}��w�N�z��gE���q�FGGG�@��gO���oo�I�&��5K5���x������+v��o����?�����qc~~~-�KOOo��!�k����������]377'�:t(;;;??������6v�������5|x�����oll|��A�L�C�D"��z	!666&&&o1�[[�I��-�;�Z�w��(m�k��qrr�={�������KufP���N��h�fQfV1�Y�����i���{EEES�L���g����c��<��_��8�����8������~����[�fff�r?�625�L^y����W����_�zu��m�����5���Pg&��s5j�����o�)����S��-,,N�8!��N�>�����L��R�����!+v�����~��c����������_~Qg&�����/3��+�;�/������~��uFP�e�z�����w���������nJ��7o���Y3n��A��3�z�������m����d����Qz�a��k���3
���m��j����+���}���8�G���j���>`��2'j���6��w�#�4�9s��������4��Cs���J����A@_Pu������[7�eccc��@����?yO���D�����k;{��!�����po�c�&����Cx��Q_u�e��|�������i�R�T�94G���3}W��x�)�Y@gU����/��F���4����������}�7�������G���_�k��q[���8��\�������`c�7{��t��Y�����4i�`4h���w��X���!�A@��l��I�N�:�j��c���~���
3
�.����*)m�QQQ���������?�M���L�fYD�.
�/��� �;���������B������H��c����Y����/LM��Z�M���q����PQQ��<�kH����Mk�n�������Sz�r���>�,%%�������Sg&���i�2�����QH�f|�-��7l����C���-2g|�a��������w�VX��m�5���w���	�����RL�E�_�#�A@+)m���m#����cihe~����j�A@�T����'w��u��uss����Y�fyxx�=�v3��u����E�������8�5�9�������[�lI����g����g��?��h��1i�<�?V�q��Y@kTs|����~�������~~~�|�	�����TT0�k�����g�w�;h�jp�^��w�>d���={��������z+�Td�\(eXBS�a	EK���	���X����b��WnW	���\(��//@�.GE��ts�������������B��E�Z�liffhkk[�!�-�w�/*�`�ETikmH��3\u0f�J]L�����l*L+�:�
��e\�7>/�r�J,���(�Z��
-�����E��c��]����;h4�_G�6��u�Ye��grF55�0,!DHSR���X@Id�j$������.�2�W!U"���U�YB�HY��*����F*|�����?��(=X�����������%����a�*�PBX����U�������w�u2|�$@=X6�wo��+��4���(u�G�&|Gx�EZ�7��/�;2J���P�W.3R�R��U�2Q%���[�p��D�^"�*we��L��H^�
�x��7�62ii+����e�?j�}��g�w'C�5p��;h]n�(����L��U���+S�hB���|7�_Qa����P�W<D!M�&Kd��B�b@�e)������S!!�g������f"�������;6���S���t�����8�Y���������_�[n�S��N��� �T����IQ%W�J�U73��"��� �T�U�����n&��*��upgaT!z0h9�XX0���K7���}<�-�q@�h\3^�ka6���)^R.eG�����������qg�Ug��T����^��D�(V��������/�apB��
0���L����9�Kf�w�;h
��
:c���J12��*Y�kV�J��rl����
��b�@"���T~9���;)��U����6���3�Z�������@��b���|�zQ�v�3`�hc�i�y9!�����E���X�U*�7����s����-g�j.�
+XCU.�WB������U�/
6FTN�8T���"��l��{5��������|��sq{��QG���*�d�4�2���%e�U@K^TJ�����ye���,W�������W2*���`Oc����w�?�y����eIs���x�PgM�4��-?{��{���=��y���K�y[<����1��#��m^����s+�,���� �Df!�*�V�N�Oimam���
���~�����/����1v�|�����0�0�������)������=���y|��wa�;��W���U�T2D(��2V^�3��0���::��W��<�5��o�������0�4`�G�{�n�S�_�X�y/���.������9F�L����B	�U���m��e2;#:���*��N�[�'�Kt�ye7���Kz-�|�����A��^�@���aw�B��T|~>�\$�&�)N�&�)���M|S��V5��s�j3��~���N����Qmx|C�h��w���f��|\P�8����oJ���wm�����s�j9x��[N�����P,��_��@Y�i+{1�)��B���}s���<X�`v"������I�Iy�|g5A��4��c~�q�����:�h���m�n�*�x�����
@�����������0��92�M��n��w�G1�DX,����"4KB	+�*E	���[h��^����4a�W������+Bh�A�������6b���]7aj���������ksI-�b$yYI���5�%
-�J�����4�������_�
��dE����YV�F�,���B�a�0��!(��E��V-�~�@�l��|��~{�����t�F4�{Em�{��;�RY�2i2�)N	�Yi)���B^_�h1�H(�������TRbKV�/�"+�"�E�d+��}�{�}���08m�j�=���
Tl��VN<��ptHs�����3���
��T���%�e	K��V�C��e����L��*���}p��(���]����
T#�����VC�e�u+��,�Jq��L[}�w��)��8?n���.w���2�x}�jO��ycWY�3�
�������GZpOh�\�����w����+�
�[v���[���:wj&����������T
X7����"�"%0�O(#!,W���^x17�&��R���/f�����	q5
�x�7������Yi��:�P,a�����Jq��� \����:B�!j38+��<���
�Ld 8�Q�~{n?�{������*���Dv���n�B���F�����m��7V�ey	SwYq
Wi�5S�[�?mh��g���'S�$�l!���Uk�����j��2�7���:D���`�����IDAT�c�[0���#�Z
�q����������0h
�����q�S�MnL?�"���*N���F��L^bV&!��0Ry%������|�W����z}��	-�=m�'����W��_����,�N����u�#�7��3iC{��9Wb[�$����[Y@�YW	!��������[Y �Ml�H�iC�<��C�.0v�?n�e;����]��f�z���[GF����0�;1m���uS�hG��qL�s����$�Em +y"0v����+���&����U#PB3VZD��IW�����gBwiI
Wk3��!�08-���_C���ni��F���������������q�m��#�0��+�}����+-$���+�j�,-��[)��H�T�H,���8��!�08%����f�'�6rR�UW��uU��fn�Jg�%B����G9��O�����3��r`�[}T
@�����P7��
���Vd�*��S�['�UW��{f�LQ���u�
����N=M[E��;��>�%������EC�Z��A-�u��]�_q�2ZdU��u1+��vn7��������I��{�v���e��94`�3�[Rr/�2����:nV���u/M��in�|��Oh�s��`���CF���K��B�&>��<��������6t����<.�5�
�#��-�����_����t���g;�8������|g����{~��q3{����;�wx4`]��qc����v�Y�wP

@7��>f#9�m���@���t��14��k�X��@5��tYX�F)1��m8�A�Uh�:n���Em/lo�wxI}5�����y����������g���)))�tP������^��2����, W_
����Eu���s�����yyyC����JII�������2��t�r���m���|gB�o)������{;88���;v���7:::))�M�6��4�~�U���������;R?:��8���pZZZTT���_hh���/]����K���d���{���K������_�P�%AAA�����
:��,��}���c�V��E�������r����---k���:
]�|������=��_�m�����deeBJJJ���wf_VO���7������m�A��+��N-������K�)S�������w���z�
�p�Bll���_�g�n�X�������3����g��#�����x����~����5!$,,���'  ���{��I�f����@��	�.�0����|�S�uL��d\�������(��3`��;�Ib��fg;u�;�v��3`B�@ P����W�/��{�@���������w��{���H$
�m_�q��|g�#h�@!��.�+nl�6�� �
����-$�;�M�;�^@������v>���At0�dL�WY&a'�w�;��C�W�<����g�D��Eg�@5���R�|!�����������z��F=�������{)�Yt0�I��k�����>�;��A����L�c�8���
k�l%�	�a�bDMN��$�����Ujq����!�Y�+K���8P��#�d����� :B�wB���hb���,�����[!5"�&�!��P2B��+��,�(n��#��&�y����b�U�WJ�5�Q��,S!�P�����}r�����F9���i���8�M#pE�������VQ��_����B[F�M�i-tc�O_�nL��*�����QBV�C�&,S�UB%0ceE\}1�W<DM�s�<���������$5�I���3'/������
>�
�=�h�������;����/������Y6���YF^	mE�<��`�y�LX��E�l��HbB$m�2�/�����C�4���P�����V�P1+�>�q1�~��<v��|��V���&��7C��#M�h1�H+�
	S����������mJHX)�
S���	#����Q��/B��L%��M����O@���!&9CG��;�F�k;��3`�%��
��;����%9_�*�
����\���i�#��W��a��J�)�-�W�3#K
=���Ua(�&���!j��C�up�!D-����GZ(l���;�l4��Y�0�8�Ac#����UU	�a	E��JHu)B�Uq(�"K�\�!�\�j�P,a������vkY�D����D>W8-t�(�:���c����c���M����w-�����!!U�Qo��8��3��ey�Y���*���q>f��p�c��m��?
�1t@��X���[��`ia�w���B���6��N��X��(#LlJ�"�P����XJH�R�RD��KS��
P��-�W��0�JQ�,[J(C��s������C�wp�2l[��9����?�K:�����V���8�E�;�N�b�e	aX����5u�?$!"(��K��,A-P@���N��J��`�./�;b����Z��Q�V�V�7�*�+�A@Q�7�&�"�$jD�����6�4A�9�|?0��{�yxH�5�����i5R�P������!j_��2���������c	O�0ceea������c�����
����h����}��5�dab8�Z�^����������KG�.D�!��-����������50�}a���F���Q��_`x'bb~6�iE65=�]�>B��5d�=�\E��%e;i��w���
��a���2A�b���0�s����G�
��]�A�.x��cV{�T�B�t��Ai���R4�B�t�?$[�8���X����������4��*���:�ZhB��������d?�&8��v-� �������o��������k��8�+27�);i��FM�;�9{j,�Bt
�}��i�{��!��)0������;���I`�11���S$����YN�]@������W������,���s`�#}����
����D��wz��Wd��bY���y����|��Y,�����yW����v6<��w�O��w�+&���GC�^��"�������W�k�*i��6!�@����Y�=�_��]�[��.����O�b/�?�	~�]����.������������`�Jx�Gz��\?�	�B�-0t1C�=z���"�v!�
�~Tv��)�����*���!0tI�aSZL7?�\R��v-o]�����G���2A�b��t�6[�bW��E_�.�s���y��cV{�T�B:�A����V+��h��t�	��l����bi�!`�>bb~�/��Q��D�
?��{����#�/�]�� ����������o>��`!�Z^	��������W�	�h��60t[����W�,��v!m@@w�x�=���`������m������0t��K����2R��i��0���I��K�nD]�z�v-� ��px��w�W��iZA���kA���������S�	t�@����?����b�h�5 ��
������U>��`�bbU<z��*�#��p�F�x�������J/M?��b�X,�]�^+***,,�]�^C����-jWwmQHP����S�EQ��3Og���o{+�b1�0���k������|���/��]hQ���vu���zR��2���Nl�^@��q��W��w�>|������u�(@9x��_���'�(����>���?����:99Y7�t
/�H�����:X�H7o�����������D�A�555�Xe�������]\\�7------uPRRUUEqww�\�C����-j��������OL:�����555666uuu\EG��9s������(��7o���O�)55u����0���5XZZ�-���-jZ�.�i�IB:�c�7���f��Y_EG�����<�o�>�Tjaa����>}��`]�����###��9s���-[��x<�/8���������p��Y��E�����Vjhh011133����I�J�OE	`���E�,��^�|������ �RZZ:e��G�B$Irrrtt���+�[�(�����Cuw�r�
A�����9s���#�m������wI�E�OJ�=CSSS���:l�����+?�����~�,�mz������������
D�(5������6lhll$�ddd�����������"Z-:r�������dmm�BRRRjkk����PSS��tI�E�OJ=��R���ZQQ1a������G�v�^��v���a�B����\.�]�^8w������������#h�����KJJ���s��-�a����0LBB��E��(F�E#F�X�|��}��<y����0�������3���z��A���K-�~R�x�����-[�����|����7;;���"�����RI�"�P[[{�������#G&%%4J��i�BBB!����\�tI�__���24���E�������g/Z�������h!$--���~��%,+""B�'��F�4���z���`������wtt���3fLg�E����Y�1�T*mnnvpp�]�^<|�pqq����sss�b1��=z���(;���cGGG4J�/��RQQQ\\<j�����Hi��Ms��=p�����vO�%��i<)577k<�hKGrr��]����~�������0�R�����������Phmmmff�F�JDD����<xp�����H4J[ff����[ZZjkk����"B��={RRR:�������"�'%���z)�p�����===E"Q��E4�6�4f�B����a�Z��L&:t������������h�:���o3STT��woB�G}$�H4�o����������t��!


Z�0����6��f�W�X�����F�����z�f����������q���{���	g�l�D"155566V��Qm��d����o��Q�r�D"���V��E��{�.i�xR�~��\���C�w�;~/��0(�?0@`
� �(@P��@����KKK���{�����v��i//���L���9s�B��fZZZbbbRR���������x�����3f�8p��r���������5K9������XSS�>��_~9�|��dU,X0v�XB�H$������B�\.wrr:����G���[�M�����N�`�rrrd2Y^^���O56���f���#F�����U#;w����S���}������wzz�����}����'���o��5z���k�B�BaIIIhhhhh����r���������\����������-P(�9�����F�;w��;��Y������l����}||����������������L�f�-��.���+@����4��b���������	�B��'O�|��7������.\��e��511Y�t��p�w��I��k�����S_h��������.\���5UG�����1gss���uaa�j����O�>}�����d��U���������Y�|9�0������fff>>>			\.����d������ SS�O?����U5����������}||,,,<==O�>�����a�������\�W_}�����x��Q"�Hu`vv�������{��w��!�a4�������uqq���Q�`oo�������>��xMMMpp0�����}��YCC��I��\nXX���^@o�0��z���}�9r���������]�i��]/^�~��������?~��I�-���o_3���7O�8�i�&�%x�4n����"�DB)--
�1c!�������k�����?{���Q&&&�|�����U#��?�H$���S^}���)))s������D*�2������H$�{��	��S�N���Bjkk�_�766���J�����_�v-666++K�U�!��D_QQ���+�v��u����w�ZYY�HQ(���KJJZ�t)!D�N�BQZZ�b�
��R�������c)))3g�TM�=^UU���_755=y�d��=��m�s���[�������5z�/�.�tM ��?���q���---YYY?���r��'�
���>a�����N����Z[[������W�#
��������
!���7*���'�d�����?�X�@�H��i���*++���;##��={^�|����������agg���!�J��������4z�hKKK�����\\\���V�^�l�!���wuu���|����'N�:"�J�>aDD���{LL�������{�nHH����\.W�<`���}�N�:�������m����U�V��=[,����g����������o��]{����W��h�+`08<���N�<���4>>~��Q�M���EEEb��������p��5%%%���k��0��������X,B�������������B����?/^�����������d2��>\�xqyyy\\���,���������)++����������mV�0�P(�3gNeeeAA�@ hsBB�����`�
Q�s���|>�����mKHHP?��f�v����f�[������������kjj�������J_��U�V��}����������u��i/����0'''�\�f��������goooee���?{�lB����2HT�lvxx8���p8���l6[H�X,�@ 0667n����?��T��_�3g�H�������������{(�pvv������V�����a����7o��U.LNNvtt�t�����jUa������&%%�x<//�a���O�b�TG�X,��5
Q�geeekk���hoo�:�����`��FFF�u���DGG;99-\�P�
����m������So������������={�f�?�������#GVWWB4R^�@?�r��d2��}�E"��o���o�����������Ke
*
�Lfbb��!R��
.m�nC:8�F�/^�P}�l��k���\����H��m���r�a8!D dee��1#77���������'0���������+..���������Jkk��?�XVV���5c���T�?0��0`
� �(@P��@�0`
���X��X�<�IEND�B`�

Masahiko Sawada

sawada.mshk@gmail.com

about 5 years ago

In reply to: Peter Geoghegan (#5)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Thank you for sharing the patch. That makes sense.

+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }

The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Masahiko Sawada (#7)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Dec 28, 2020 at 10:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Right. Or to consider if this is an anti-wraparound VACUUM might be
nice -- maybe we should skip index vacuuming + lazy_vacuum_heap() if
and only if we're under pressure to advance datfrozenxid for the whole
DB, and really need to hurry up. (I think that we could both probably
think of way too many ideas like this one.)

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
all indexes except "abalance_ruin") had zero growth. They did not even
become larger by 1 block. As I often say when talking about work in
this area, this is not a quantitative difference -- it's a qualitative
difference. (If they grew even a tiny amount, say by only 1 block,
further growth is likely to follow.)

The "abalance_ruin" index was smaller with the patch. Its size started
off at 253,779 blocks with both the patch and master branch (which is
very small, because of B-Tree deduplication). By the end of 2 pairs of
runs for the patch (2 3 hour runs) the size grew to 502,016 blocks.
But with the master branch it grew to 540,090 blocks. (For reference,
the primary key on pgbench_accounts started out at 822,573 blocks.)

My guess is that this would compare favorably with "magic VACUUM" [1]/messages/by-id/CAH2-Wz=rPkB5vXS7AZ+v8t3VE75v_dKGro+w4nWd64E9yiCLEQ@mail.gmail.com -- Peter Geoghegan
(I refer to a thought experiment that is useful for understanding the
principles behind bottom-up index deletion). The fact that
"abalance_ruin" becomes bloated probably doesn't have that much to do
with MVCC versioning. In other words, I suspect that the index
wouldn't be that smaller in a traditional two-phase locking database
system with the same workload. Words like "bloat" and "fragmentation"
have always been overloaded/ambiguous in highly confusing ways, which
is why I find it useful to compare a real world workload/benchmark to
some kind of theoretical ideal behavior.

This test wasn't particularly sympathetic to the patch because most of
the indexes (all but the PK) were useless -- they did not get used by
query plans. So the final size of "abalance_ruin" (or any other index)
isn't even the truly important thing IMV (the benchmark doesn't
advertise the truly important thing for me). The truly important thing
is that the worst case number of versions *per logical row* is tightly
controlled. It doesn't necessarily matter all that much if 30% of an
index's tuples are garbage, as long as the garbage tuples are evenly
spread across all logical rows in the table (in practice it's pretty
unlikely that that would actually happen, but it's possible in theory,
and if it did happen it really wouldn't be so bad).

[1]: /messages/by-id/CAH2-Wz=rPkB5vXS7AZ+v8t3VE75v_dKGro+w4nWd64E9yiCLEQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Peter Geoghegan (#8)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Dec 28, 2020 at 11:20 PM Peter Geoghegan <pg@bowt.ie> wrote:

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

The pgbench_accounts heap table and 3 out of 4 of its indexes (i.e.
all indexes except "abalance_ruin") had zero growth. They did not even
become larger by 1 block. As I often say when talking about work in
this area, this is not a quantitative difference -- it's a qualitative
difference. (If they grew even a tiny amount, say by only 1 block,
further growth is likely to follow.)

I forgot to say: I don't know what the exact impact was on the VM bit
setting, but I doubt that it was noticeably worse for the patch. It
cannot have been better, though.

It's inherently almost impossible to keep most of the VM bits set for
long with this workload. Perhaps VM bit setting would be improved with
workloads that have some HOT updates, but as I mentioned this workload
only had non-HOT updates (except in a tiny number of cases where
abalance did not change, just by random luck).

I also forget to say that the maintenance_work_mem test wasn't that
relevant, though I believe it triggered once. maintenance_work_mem was
set very high (5GB).

Here is a link with more details information, in case that is
interesting: https://drive.google.com/file/d/1TqpAQnqb4SMMuhehD8ELpf6Cv9A8ux2E/view?usp=sharing

--
Peter Geoghegan

#10

Masahiko Sawada

sawada.mshk@gmail.com

about 5 years ago

In reply to: Masahiko Sawada (#7)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Thank you for sharing the patch. That makes sense.
+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }
The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

I merged these patches and polished it.

In the 0002 patch, we calculate how many LP_DEAD items can be
accumulated in the space on a single heap page left by fillfactor. I
increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
on a heap page. Because otherwise accumulating LP_DEAD items
unnecessarily constrains the number of heap tuples in a single page,
especially when small tuples, as I mentioned before. Previously, we
constrained the number of line pointers to avoid excessive
line-pointer bloat and not require an increase in the size of the work
array. However, once amvacuumstrategy stuff entered the picture,
accumulating line pointers has value. Also, we might want to store the
returned value of amvacuumstrategy so that index AM can refer to it on
index-deletion.

The 0003 patch has btree indexes skip bulk-deletion if the index
doesn't grow since last bulk-deletion. I stored the number of blocks
in the meta page but didn't implement meta page upgrading.

I've attached the draft version patches. Note that the documentation
update is still lacking.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

0001-Introduce-IndexAM-API-for-choosing-index-vacuum-stra.patchapplication/octet-stream; name=0001-Introduce-IndexAM-API-for-choosing-index-vacuum-stra.patchDownload

From d0f2cb5ab0f565dea02cad2b91583c4f985dcf4d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:34:10 +0900
Subject: [PATCH 1/3] Introduce IndexAM API for choosing index vacuum strategy.

---
 contrib/bloom/bloom.h                 |  1 +
 contrib/bloom/blutils.c               |  1 +
 contrib/bloom/blvacuum.c              |  9 +++++++++
 src/backend/access/brin/brin.c        | 10 ++++++++++
 src/backend/access/gin/ginutil.c      |  1 +
 src/backend/access/gin/ginvacuum.c    |  9 +++++++++
 src/backend/access/gist/gist.c        |  1 +
 src/backend/access/gist/gistvacuum.c  |  9 +++++++++
 src/backend/access/hash/hash.c        | 10 ++++++++++
 src/backend/access/index/indexam.c    | 19 +++++++++++++++++++
 src/backend/access/nbtree/nbtree.c    | 10 ++++++++++
 src/backend/access/spgist/spgutils.c  |  1 +
 src/backend/access/spgist/spgvacuum.c |  9 +++++++++
 src/include/access/amapi.h            |  3 +++
 src/include/access/brin_internal.h    |  1 +
 src/include/access/genam.h            | 12 +++++++++++-
 src/include/access/gin_private.h      |  1 +
 src/include/access/gist_private.h     |  1 +
 src/include/access/hash.h             |  1 +
 src/include/access/nbtree.h           |  1 +
 src/include/access/spgist.h           |  1 +
 21 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 436bd43209..6d1fab05ee 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -201,6 +201,7 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..8098d75c82 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 88b0a6d290..982ebf97e6 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,15 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 58fe109d2d..776ac3f64c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = brinvacuumstrategy;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -770,6 +771,15 @@ brinbuildempty(Relation index)
 	UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+brinvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * brinbulkdelete
  *		Since there are no per-heap-tuple index tuples in BRIN indexes,
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..fc375332fc 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..4bd6d32435 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,15 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e4b251a58f..4b9efecd9c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..7dc8c3d860 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,15 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 263ae23ab0..fd4626dfab 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -443,6 +444,15 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index c2b98e8a72..a58cdaf161 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -676,6 +676,25 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - ask index vacuum strategy
+ *
+ * This callback routine is called just before vacuuming the heap.
+ * Returns IndexVacuumStrategy value to tell the lazy vacuum whether to
+ * do index deletion.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(amvacuumstrategy);
+
+	return indexRelation->rd_indam->amvacuumstrategy(info);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ba79a7f3e9..800f7a14b6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -863,6 +864,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..7b2313590a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..1df6dfd5da 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,15 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Currently always do ambulkdelete.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info)
+{
+	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index de758cab0b..5f784c0af1 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -111,6 +111,8 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   Relation heapRelation,
 								   IndexUniqueCheck checkUnique,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -258,6 +260,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 85c612e490..6695ab75d9 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -97,6 +97,7 @@ extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void brinendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy brinvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *brinbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index aa8ff360da..112b90e4cf 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -34,7 +34,8 @@ typedef struct IndexBuildResult
 } IndexBuildResult;
 
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -125,6 +126,14 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -173,6 +182,7 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index a7a71ae1b4..ed511548ff 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -396,6 +396,7 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index e899e81749..6ffc0730ea 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -532,6 +532,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 22a99e7083..8f8437ab4c 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -371,6 +371,7 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b793dab9fa..b8247537fd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1008,6 +1008,7 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 38a5902202..76bedf2b97 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -211,6 +211,7 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
-- 
2.27.0

0002-Choose-index-vacuum-strategy-based-on-amvacuumstrate.patchapplication/octet-stream; name=0002-Choose-index-vacuum-strategy-based-on-amvacuumstrate.patchDownload

From c1a20d5c4de27f0059e4928405ed81c298c123d3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:35:35 +0900
Subject: [PATCH 2/3] Choose index vacuum strategy based on amvacuumstrategy
 IndexAM API.

If index_cleanup option is specified neither VACUUM command nor
storage option, lazy vacuum asks each index the vacuum strategy before
heap vacuum and decides whether or not to remove the collected garbage
tuples from the heap based on both the answers of amvacuumstrategy and
how many LP_DEAD items can be accumlated in a space of heap page left
by fillfactor.

The decision made by lazy vacuum is passed to ambulkdelete. Then each
index can choose whether or not to skip index bulk-deletion
accordingly.
---
 contrib/bloom/blvacuum.c                |   9 +-
 src/backend/access/brin/brin.c          |  10 +-
 src/backend/access/common/reloptions.c  |  35 ++-
 src/backend/access/gin/ginpostinglist.c |  30 +--
 src/backend/access/gin/ginvacuum.c      |  11 +
 src/backend/access/gist/gistvacuum.c    |  14 +-
 src/backend/access/hash/hash.c          |   7 +
 src/backend/access/heap/vacuumlazy.c    | 273 ++++++++++++++++++------
 src/backend/access/nbtree/nbtree.c      |  19 ++
 src/backend/access/spgist/spgvacuum.c   |  14 +-
 src/backend/catalog/index.c             |   2 +
 src/backend/commands/analyze.c          |   2 +
 src/backend/commands/vacuum.c           |  23 +-
 src/include/access/genam.h              |  15 ++
 src/include/access/htup_details.h       |  17 +-
 src/include/commands/vacuum.h           |  20 +-
 src/include/utils/rel.h                 |  17 +-
 17 files changed, 388 insertions(+), 130 deletions(-)

diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 982ebf97e6..9f8bfc2413 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -54,6 +54,13 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BloomMetaPageData *metaData;
 	GenericXLogState *gxlogState;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
@@ -181,7 +188,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	BlockNumber npages,
 				blkno;
 
-	if (info->analyze_only)
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (stats == NULL)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 776ac3f64c..a429468c57 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -783,7 +783,8 @@ brinvacuumstrategy(IndexVacuumInfo *info)
 /*
  * brinbulkdelete
  *		Since there are no per-heap-tuple index tuples in BRIN indexes,
- *		there's not a lot we can do here.
+ *		there's not a lot we can do here regardless of
+ *		info->will_vacuum_heap.
  *
  * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
  * tuple is deleted), meaning the need to re-run summarization on the affected
@@ -809,8 +810,11 @@ brinvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	heapRel;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (!stats)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c687d3ee9e..3080cbbc6b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "commands/vacuum.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
 #include "postmaster/postmaster.h"
@@ -140,15 +141,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -492,6 +484,18 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/* values from VacOptTernaryValue */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"default", VACOPT_TERNARY_DEFAULT},
+	{"true", VACOPT_TERNARY_ENABLED},
+	{"false", VACOPT_TERNARY_DISABLED},
+	{"on", VACOPT_TERNARY_ENABLED},
+	{"off", VACOPT_TERNARY_DISABLED},
+	{"1", VACOPT_TERNARY_ENABLED},
+	{"0", VACOPT_TERNARY_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -516,6 +520,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_TERNARY_DEFAULT,
+		gettext_noop("Valid values are \"on\", \"off\", and \"default\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1856,7 +1871,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)}
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 216b2b9a2c..e49c94b860 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -22,29 +22,29 @@
 
 /*
  * For encoding purposes, item pointers are represented as 64-bit unsigned
- * integers. The lowest 11 bits represent the offset number, and the next
- * lowest 32 bits are the block number. That leaves 21 bits unused, i.e.
- * only 43 low bits are used.
+ * integers. The lowest 13 bits represent the offset number, and the next
+ * lowest 32 bits are the block number. That leaves 19 bits unused, i.e.
+ * only 45 low bits are used.
  *
- * 11 bits is enough for the offset number, because MaxHeapTuplesPerPage <
- * 2^11 on all supported block sizes. We are frugal with the bits, because
+ * 13 bits is enough for the offset number, because MaxHeapTuplesPerPage <
+ * 2^13 on all supported block sizes. We are frugal with the bits, because
  * smaller integers use fewer bytes in the varbyte encoding, saving disk
  * space. (If we get a new table AM in the future that wants to use the full
  * range of possible offset numbers, we'll need to change this.)
  *
- * These 43-bit integers are encoded using varbyte encoding. In each byte,
+ * These 45-bit integers are encoded using varbyte encoding. In each byte,
  * the 7 low bits contain data, while the highest bit is a continuation bit.
  * When the continuation bit is set, the next byte is part of the same
- * integer, otherwise this is the last byte of this integer. 43 bits need
+ * integer, otherwise this is the last byte of this integer. 45 bits need
  * at most 7 bytes in this encoding:
  *
  * 0XXXXXXX
- * 1XXXXXXX 0XXXXYYY
- * 1XXXXXXX 1XXXXYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuuY
+ * 1XXXXXXX 0XXXXXXY
+ * 1XXXXXXX 1XXXXXXY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuYYY
  *
  * X = bits used for offset number
  * Y = bits used for block number
@@ -73,12 +73,12 @@
 
 /*
  * How many bits do you need to encode offset number? OffsetNumber is a 16-bit
- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 13 ought to be more
  * than enough. It's tempting to derive this from MaxHeapTuplesPerPage, and
  * use the minimum number of bits, but that would require changing the on-disk
  * format if MaxHeapTuplesPerPage changes. Better to leave some slack.
  */
-#define MaxHeapTuplesPerPageBits		11
+#define MaxHeapTuplesPerPageBits		13
 
 /* Max. number of bytes needed to encode the largest supported integer. */
 #define MaxBytesPerInteger				7
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 4bd6d32435..3972c758d0 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -580,6 +580,13 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
 	uint32		nRoot;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
 									   "Gin vacuum temporary context",
 									   ALLOCSET_DEFAULT_SIZES);
@@ -717,6 +724,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		return stats;
 	}
 
+	/* Skip index cleanup if user requests to disable */
+	if (!info->vacuumcleanup_requested)
+		return stats;
+
 	/*
 	 * Set up all-zero stats and cleanup pending inserts if ginbulkdelete
 	 * wasn't called
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 7dc8c3d860..883d2e9d8d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -68,6 +68,13 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -83,8 +90,11 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fd4626dfab..07348b39e0 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -478,6 +478,13 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	tuples_removed = 0;
 	num_index_tuples = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..d77616a7a1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -214,6 +214,18 @@ typedef struct LVShared
 	double		reltuples;
 	bool		estimated_count;
 
+	/*
+	 * Copied from LVRelStats. It tells index AM that lazy vacuum will remove
+	 * dead tuples from the heap after index vacuum.
+	 */
+	bool vacuum_heap;
+
+	/*
+	 * Copied from LVRelStats. It tells index AM whether amvacuumcleanup is
+	 * requested or not.
+	 */
+	bool vacuumcleanup_requested;
+
 	/*
 	 * In single process lazy vacuum we could consume more memory during index
 	 * vacuuming or cleanup apart from the memory for heap scanning.  In
@@ -293,8 +305,8 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -313,6 +325,8 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		vacuum_heap;	/* do we remove dead tuples from the heap? */
+	bool		vacuumcleanup_requested; /* INDEX_CLEANUP is set to false */
 
 	/* Used for error callback */
 	char	   *indname;
@@ -343,6 +357,13 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+								   VacuumParams *params, Relation *Irel,
+								   int nindexes, int ndeaditems);
+static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+										  LVRelStats *vacrelstats, Relation *Irel,
+										  int nindexes, IndexBulkDeleteResult **stats,
+										  LVParallelState *lps, int *maxdeadtups);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -442,7 +463,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -501,8 +521,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrelstats->hasindex = (nindexes > 0);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -763,6 +782,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
+	int			maxdeadtups = 0;	/* maximum # of dead tuples in a single page */
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
 				tups_vacuumed,	/* tuples cleaned up by vacuum */
@@ -811,14 +831,26 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	/*
+	 * index vacuum cleanup is enabled if index cleanup is not disabled,
+	 * i.g., either default or enabled. For index bulk-deletion, it will
+	 * be decided by choose_vacuum_strategy() when INDEX_CLEANUP option is
+	 * default.
+	 */
+	vacrelstats->vacuumcleanup_requested =
+		(params->index_cleanup != VACOPT_TERNARY_DISABLED);
+
 	vistest = GlobalVisTestFor(onerel);
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * least two indexes on a table. When the index cleanup is disabled,
+	 * since index bulk-deletion is likely to be no-op we disable a parallel
+	 * vacuum.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1 &&
+		params->index_cleanup != VACOPT_TERNARY_DISABLED)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -1050,19 +1082,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Vacuum the table and its indexes */
+			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+										  Irel, nindexes, indstats,
+										  lps, &maxdeadtups);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1512,32 +1535,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/*
 		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
+		 * doing a second scan.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrelstats->hasindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_tuples = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(nindexes == 0);
+
+			/* Remove tuples from heap if the table has no index */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+			has_dead_tuples = false;
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1663,6 +1670,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+		else
+			maxdeadtups = Max(maxdeadtups,
+							  dead_tuples->num_tuples - prev_dead_count);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1702,14 +1712,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+									  Irel, nindexes, indstats,
+									  lps, &maxdeadtups);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1722,7 +1727,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1775,6 +1780,134 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+							  LVRelStats *vacrelstats, Relation *Irel,
+							  int nindexes, IndexBulkDeleteResult **indstats,
+							  LVParallelState *lps, int *maxdeadtups)
+{
+	/*
+	 * Choose the vacuum strategy for this vacuum cycle.
+	 * choose_vacuum_strategy() will set the decision to
+	 * vacrelstats->vacuum_heap.
+	 */
+	choose_vacuum_strategy(onerel, vacrelstats, params, Irel, nindexes,
+						   *maxdeadtups);
+
+	/* Work on all the indexes, then the heap */
+	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+							nindexes);
+
+	if (vacrelstats->vacuum_heap)
+	{
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * Here, we don't do heap vacuum in this cycle.
+		 *
+		 * Note that vacrelstats->dead_tuples could have tuples which
+		 * became dead after HOT-pruning but are not marked dead yet.
+		 * We do not process them because it's a very rare condition,
+		 * and the next vacuum will process them anyway.
+		 */
+		Assert(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+	*maxdeadtups = 0;
+}
+
+/*
+ * Decide whether or not we remove the collected garbage tuples from the
+ * heap. The decision is set to vacrelstats->vacuum_heap. ndeaditems is
+ * maximum number of LP_DEAD items on any one heap page encountered during
+ * heap scan.
+ */
+static void
+choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+					   VacuumParams *params, Relation *Irel, int nindexes,
+					   int ndeaditems)
+{
+	bool vacuum_heap = true;
+
+	/*
+	 * If index cleanup option is specified, we use it.
+	 *
+	 * XXX: should we call amvacuumstrategy even if INDEX_CLEANUP
+	 * is specified?
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+		vacuum_heap = true;
+	else if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		vacuum_heap = false;
+	else
+	{
+		int i;
+
+		/*
+		 * If index cleanup option is not specified, we decide the vacuum
+		 * strategy based on the returned values from amvacuumstrategy.
+		 * If even one index returns 'none', we skip heap vacuum in this
+		 * vacuum cycle.
+		 */
+		for (i = 0; i < nindexes; i++)
+		{
+			IndexVacuumStrategy ivacstrat;
+			IndexVacuumInfo ivinfo;
+
+			ivinfo.index = Irel[i];
+			ivinfo.message_level = elevel;
+
+			ivacstrat = index_vacuum_strategy(&ivinfo);
+
+			if (ivacstrat == INDEX_VACUUM_STRATEGY_NONE)
+			{
+				vacuum_heap = false;
+				break;
+			}
+		}
+
+		if (!vacuum_heap)
+		{
+			Size freespace = RelationGetTargetPageFreeSpace(onerel,
+															HEAP_DEFAULT_FILLFACTOR);
+			int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7);
+
+			/*
+			 * The test of ndeaditems_limit is for the maximum number of LP_DEAD
+			 * items on any one heap page encountered during heap scan by caller.
+			 * The general idea here is to preserve the original pristine state of
+			 * the table when it is subject to constant non-HOT updates when heap
+			 * fill factor is reduced from its default.
+			 *
+			 * ndeaditems_limit is calculated by using the freespace left by
+			 * fillfactor -- we can fit (freespace / sizeof(ItemIdData)) LP_DEAD
+			 * items on heap pages before they start to "overflow" with that setting.
+			 * We're trying to avoid having VACUUM call lazy_vacuum_heap() in most
+			 * cases, but we don't want to be too aggressive: it would be risky to
+			 * make the value we test for much higher, since it might be too late
+			 * by the time we actually call lazy_vacuum_heap(). So  multiply by 0.7
+			 * is the safety factor.
+			 */
+			if (ndeaditems > ndeaditems_limit)
+				vacuum_heap = true;
+		}
+	}
+
+	vacrelstats->vacuum_heap = vacuum_heap;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *
@@ -1827,7 +1960,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
  *	lazy_vacuum_heap() -- second pass over the heap
  *
@@ -2120,6 +2252,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 	 */
 	nworkers = Min(nworkers, lps->pcxt->nworkers);
 
+	/* Copy the information to the shared state */
+	lps->lvshared->vacuum_heap = vacrelstats->vacuum_heap;
+	lps->lvshared->vacuumcleanup_requested = vacrelstats->vacuumcleanup_requested;
+
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
@@ -2444,6 +2580,8 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.will_vacuum_heap = vacrelstats->vacuum_heap;
+	ivinfo.vacuumcleanup_requested = false;
 
 	/*
 	 * Update error traceback information.
@@ -2461,11 +2599,16 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) dead_tuples);
 
-	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+	/*
+	 * XXX: we don't want to report if ambulkdelete was no-op because of
+	 * will_vacuum_heap. But we cannot know it was or not.
+	 */
+	if (*stats)
+		ereport(elevel,
+				(errmsg("scanned index \"%s\" to remove %d row versions",
+						vacrelstats->indname,
+						dead_tuples->num_tuples),
+				 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
@@ -2495,9 +2638,10 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.report_progress = false;
 	ivinfo.estimated_count = estimated_count;
 	ivinfo.message_level = elevel;
-
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.will_vacuum_heap = false;
+	ivinfo.vacuumcleanup_requested = vacrelstats->vacuumcleanup_requested;
 
 	/*
 	 * Update error traceback information.
@@ -2844,14 +2988,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2881,7 +3025,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
@@ -3573,6 +3717,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacrelstats.indname = NULL;
 	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
 
+	vacrelstats.vacuum_heap = lvshared->vacuum_heap;
+	vacrelstats.vacuumcleanup_requested = lvshared->vacuumcleanup_requested;
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrelstats;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 800f7a14b6..c9a177d5e1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -822,6 +822,18 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 */
 		result = true;
 	}
+	else if (!info->vacuumcleanup_requested)
+	{
+		/*
+		 * Skip cleanup if INDEX_CLEANUP is set to false, even if there might
+		 * be a deleted page that can be recycled. If INDEX_CLEANUP continues
+		 * to be disabled, recyclable pages could be left by XID wraparound.
+		 * But in practice it's not so harmful since such workload doesn't need
+		 * to delete and recycle pages in any case and deletion of btree index
+		 * pages is relatively rare.
+		 */
+		result = false;
+	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
@@ -887,6 +899,13 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Relation	rel = info->index;
 	BTCycleId	cycleid;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 1df6dfd5da..cc645435ea 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -916,6 +916,13 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 {
 	spgBulkDeleteState bds;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -946,8 +953,11 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	spgBulkDeleteState bds;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cffbc0ac38..b0fde8eaad 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3401,6 +3401,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.will_vacuum_heap = true;
+	ivinfo.vacuumcleanup_requested = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..5f9960c710 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -668,6 +668,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.will_vacuum_heap = false;
+			ivinfo.vacuumcleanup_requested = true;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b97d48ee01..079f0a44c9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1870,17 +1870,20 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option if vacuum_index_cleanup reloption is set.
+	 * Otherwise we leave it as 'default', which means that we choose vacuum
+	 * strategy based on the table and index status. See choose_vacuum_strategy().
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
-	/* Set truncate option based on reloptions if not yet */
+	/*
+	 * Set truncate option based on reloptions if not yet. Truncate option
+	 * is true by default.
+	 */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
 	{
 		if (onerel->rd_options == NULL ||
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 112b90e4cf..0798f39d39 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -48,6 +48,21 @@ typedef struct IndexVacuumInfo
 	bool		analyze_only;	/* ANALYZE (without any actual vacuum) */
 	bool		report_progress;	/* emit progress.h status reports */
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
+
+	/*
+	 * True if lazy vacuum won't delete the collected garbage tuples from
+	 * the heap.  In this case, the index AM can skip index bulk-deletion
+	 * safely. This field is used only when ambulkdelete.
+	 */
+	bool		will_vacuum_heap;
+
+	/*
+	 * amvacuumcleanup is requested by lazy vacuum. If false, the index AM
+	 * can skip index cleanup. This can be false if INDEX_CLEANUP vacuum option
+	 * is set to false. This field is used only when amvacuumcleanup.
+	 */
+	bool		vacuumcleanup_requested;
+
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
  *
- * Note: with HOT, there could theoretically be more line pointers (not actual
- * tuples) than this on a heap page.  However we constrain the number of line
- * pointers to this anyway, to avoid excessive line-pointer bloat and not
- * require increases in the size of work arrays.
+ * We used to constrain the number of line pointers to avoid excessive
+ * line-pointer bloat and not require increases in the size of work arrays.
+ * But since index vacuum strategy had entered the picture, accumulating
+ * LP_DEAD line pointer has value of skipping index deletion.
+ *
+ * XXX: allowing to fill the heap page with only line pointer seems a overkill.
  */
 #define MaxHeapTuplesPerPage	\
-	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-			(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
 
 /*
  * MaxAttrSize is a somewhat arbitrary upper limit on the declared size of
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 857509287d..c3ea4e1cb8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -186,19 +187,6 @@ typedef enum VacuumOption
 	VACOPT_DISABLE_PAGE_SKIPPING = 1 << 7	/* don't skip any pages */
 } VacuumOption;
 
-/*
- * A ternary value used by vacuum parameters.
- *
- * DEFAULT value is used to determine the value based on other
- * configurations, e.g. reloptions.
- */
-typedef enum VacOptTernaryValue
-{
-	VACOPT_TERNARY_DEFAULT = 0,
-	VACOPT_TERNARY_DISABLED,
-	VACOPT_TERNARY_ENABLED,
-} VacOptTernaryValue;
-
 /*
  * Parameters customizing behavior of VACUUM and ANALYZE.
  *
@@ -218,8 +206,10 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup. In
+										 * default mode, it's decided based on
+										 * multiple factors. See
+										 * choose_vacuum_strategy. */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..168dc5d466 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,20 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+/*
+ * A ternary value used by vacuum parameters. This value also is used
+ * for VACUUM command options.
+ *
+ * DEFAULT value is used to determine the value based on other
+ * configurations, e.g. reloptions.
+ */
+typedef enum VacOptTernaryValue
+{
+	VACOPT_TERNARY_DEFAULT = 0,
+	VACOPT_TERNARY_DISABLED,
+	VACOPT_TERNARY_ENABLED,
+} VacOptTernaryValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +318,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptTernaryValue	vacuum_index_cleanup;	/* enables index vacuuming
+												 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 } StdRdOptions;
 
-- 
2.27.0

0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchapplication/octet-stream; name=0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchDownload

From 7b72e34e931dca1c0c8ea77f182b33a739dc2eba Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH 3/3] Skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

XXX: need to upgrade the meta page version.
---
 contrib/pageinspect/Makefile                  |  3 +-
 contrib/pageinspect/btreefuncs.c              |  5 +++
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 22 +++++++++++++
 contrib/pageinspect/pageinspect.control       |  2 +-
 src/backend/access/nbtree/nbtpage.c           |  9 ++++-
 src/backend/access/nbtree/nbtree.c            | 33 ++++++++++++++++++-
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 +--
 src/include/access/nbtree.h                   |  2 ++
 src/include/access/nbtxlog.h                  |  1 +
 10 files changed, 77 insertions(+), 6 deletions(-)
 create mode 100644 contrib/pageinspect/pageinspect--1.8--1.9.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index d9d8177116..a0760afa4e 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -12,7 +12,8 @@ OBJS = \
 	rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
+DATA = pageinspect--1.8--1.9.sql \
+	pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 445605db58..94f648118f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -692,6 +692,11 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = "f";
 	}
 
+	if (metad->btm_version >= BTREE_VERSION)
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
+	else
+		values[j++] = "-1";
+
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
 								   values);
 
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
new file mode 100644
index 0000000000..bd1752cf35
--- /dev/null
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -0,0 +1,22 @@
+/* contrib/pageinspect/pageinspect--1.8-1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.9'" to load this file. \quit
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control
index f8cdf526c6..bd716769a1 100644
--- a/contrib/pageinspect/pageinspect.control
+++ b/contrib/pageinspect/pageinspect.control
@@ -1,5 +1,5 @@
 # pageinspect extension
 comment = 'inspect the contents of database pages at a low level'
-default_version = '1.8'
+default_version = '1.9'
 module_pathname = '$libdir/pageinspect'
 relocatable = true
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 89eb66a8a6..eac78d3b7e 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -76,6 +76,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = InvalidBlockNumber;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -115,6 +116,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -179,17 +181,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -211,6 +216,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -230,6 +236,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9a177d5e1..7409c23a5c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -882,7 +882,38 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info)
 {
-	return INDEX_VACUUM_STRATEGY_BULKDELETE;
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+
+	if (metad->btm_version < BTREE_VERSION)
+	{
+		/*
+		 * Do bulk-deletion if metapage needs upgrade, because we don't
+		 * have meta-information yet.
+		 */
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
+	else
+	{
+		BlockNumber	nblocks = RelationGetNumberOfBlocks(info->index);
+
+		/*
+		 * Do deletion if the index grows since the last deletion or for
+		 * the first time.
+		 */
+		if (!BlockNumberIsValid(metad->btm_last_deletion_nblocks) ||
+			 nblocks > metad->btm_last_deletion_nblocks)
+			result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 45313d924c..65e537211c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4c4af9fce0..462838682e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b8247537fd..a56baea310 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -109,6 +109,8 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index f5d3e9f5e0..45f01a3dc9 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

#11

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#10)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Thank you for sharing the patch. That makes sense.
+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }
The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?
I merged these patches and polished it.

In the 0002 patch, we calculate how many LP_DEAD items can be
accumulated in the space on a single heap page left by fillfactor. I
increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
on a heap page. Because otherwise accumulating LP_DEAD items
unnecessarily constrains the number of heap tuples in a single page,
especially when small tuples, as I mentioned before. Previously, we
constrained the number of line pointers to avoid excessive
line-pointer bloat and not require an increase in the size of the work
array. However, once amvacuumstrategy stuff entered the picture,
accumulating line pointers has value. Also, we might want to store the
returned value of amvacuumstrategy so that index AM can refer to it on
index-deletion.

The 0003 patch has btree indexes skip bulk-deletion if the index
doesn't grow since last bulk-deletion. I stored the number of blocks
in the meta page but didn't implement meta page upgrading.

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

I’ve attached the updated version patch that includes the following changes:

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Attachments:

v2-0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchapplication/octet-stream; name=v2-0003-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchDownload

From aa09db083ddb9efa56eb3e37efd70ad95384ecc7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH v2 3/3] Skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

XXX: need to upgrade the meta page version.
---
 contrib/pageinspect/btreefuncs.c              |  5 +++
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 18 +++++++++
 src/backend/access/nbtree/nbtpage.c           |  9 ++++-
 src/backend/access/nbtree/nbtree.c            | 40 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 ++-
 src/include/access/nbtree.h                   |  2 +
 src/include/access/nbtxlog.h                  |  1 +
 9 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 445605db58..94f648118f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -692,6 +692,11 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = "f";
 	}
 
+	if (metad->btm_version >= BTREE_VERSION)
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
+	else
+		values[j++] = "-1";
+
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
 								   values);
 
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 17bf0c5470..5362bcb475 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -3,7 +3,7 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
 CREATE INDEX test1_a_idx ON test1 USING btree (a);
 \x
 SELECT * FROM bt_metap('test1_a_idx');
--[ RECORD 1 ]-----------+-------
+-[ RECORD 1 ]-----------+-----------
 magic                   | 340322
 version                 | 4
 root                    | 1
@@ -13,6 +13,7 @@ fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
 allequalimage           | t
+last_deletion_nblocks   | 4294967295
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index 9dc342fabc..a87b74ce2a 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -39,3 +39,21 @@ CREATE FUNCTION gist_page_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gist_page_items'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..d686f25a7a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -82,6 +82,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = InvalidBlockNumber;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -121,6 +122,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -185,17 +187,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -217,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -236,6 +242,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e00e5fe0a4..56162cf41c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,16 +878,50 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
- * is specified to off.
+ * Choose the vacuum strategy. Do bulk-deletion or nothing
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
 {
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	/*
+	 * Don't want to do bulk-deletion if index cleanup is disabled
+	 * by the user request.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 		return INDEX_VACUUM_STRATEGY_NONE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+
+	if (metad->btm_version < BTREE_VERSION)
+	{
+		/*
+		 * Do bulk-deletion if metapage needs upgrade, because we don't
+		 * have meta-information yet.
+		 */
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
 	else
-		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+	{
+		BlockNumber	nblocks = RelationGetNumberOfBlocks(info->index);
+
+		/*
+		 * Do deletion if the index grows since the last deletion, by
+		 * even one block,	or for the first time.
+		 */
+		if (!BlockNumberIsValid(metad->btm_last_deletion_nblocks) ||
+			 nblocks > metad->btm_last_deletion_nblocks)
+			result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..37546f566d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..4e58b0bc07 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ba120d4a80..f116e29735 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -110,6 +110,8 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..bc0c52a779 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patchapplication/octet-stream; name=v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patchDownload

From 551f645ec91ed721f1b9c79b79235472e59a3c4d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH v2 3/3] PoC: skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

XXX: need to upgrade the meta page version.
---
 contrib/pageinspect/btreefuncs.c              |  5 +++
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 18 +++++++++
 src/backend/access/nbtree/nbtpage.c           |  9 ++++-
 src/backend/access/nbtree/nbtree.c            | 40 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 ++-
 src/include/access/nbtree.h                   |  2 +
 src/include/access/nbtxlog.h                  |  1 +
 9 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 445605db58..94f648118f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -692,6 +692,11 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = "f";
 	}
 
+	if (metad->btm_version >= BTREE_VERSION)
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
+	else
+		values[j++] = "-1";
+
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
 								   values);
 
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 17bf0c5470..5362bcb475 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -3,7 +3,7 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
 CREATE INDEX test1_a_idx ON test1 USING btree (a);
 \x
 SELECT * FROM bt_metap('test1_a_idx');
--[ RECORD 1 ]-----------+-------
+-[ RECORD 1 ]-----------+-----------
 magic                   | 340322
 version                 | 4
 root                    | 1
@@ -13,6 +13,7 @@ fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
 allequalimage           | t
+last_deletion_nblocks   | 4294967295
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index 9dc342fabc..a87b74ce2a 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -39,3 +39,21 @@ CREATE FUNCTION gist_page_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gist_page_items'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..d686f25a7a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -82,6 +82,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = InvalidBlockNumber;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -121,6 +122,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -185,17 +187,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -217,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -236,6 +242,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e00e5fe0a4..56162cf41c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,16 +878,50 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
- * is specified to off.
+ * Choose the vacuum strategy. Do bulk-deletion or nothing
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
 {
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	/*
+	 * Don't want to do bulk-deletion if index cleanup is disabled
+	 * by the user request.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 		return INDEX_VACUUM_STRATEGY_NONE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+
+	if (metad->btm_version < BTREE_VERSION)
+	{
+		/*
+		 * Do bulk-deletion if metapage needs upgrade, because we don't
+		 * have meta-information yet.
+		 */
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
 	else
-		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+	{
+		BlockNumber	nblocks = RelationGetNumberOfBlocks(info->index);
+
+		/*
+		 * Do deletion if the index grows since the last deletion, by
+		 * even one block,	or for the first time.
+		 */
+		if (!BlockNumberIsValid(metad->btm_last_deletion_nblocks) ||
+			 nblocks > metad->btm_last_deletion_nblocks)
+			result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..37546f566d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..4e58b0bc07 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ba120d4a80..f116e29735 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -110,6 +110,8 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..bc0c52a779 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patchapplication/octet-stream; name=v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patchDownload

From 57c5c5199255227bc75ed7ba1b2a5f70136f17f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:34:10 +0900
Subject: [PATCH v2 1/3] Introduce IndexAM API for choosing index vacuum
 strategy.

---
 contrib/bloom/bloom.h                 |  2 ++
 contrib/bloom/blutils.c               |  1 +
 contrib/bloom/blvacuum.c              | 13 +++++++++++++
 src/backend/access/brin/brin.c        |  1 +
 src/backend/access/gin/ginutil.c      |  1 +
 src/backend/access/gin/ginvacuum.c    | 13 +++++++++++++
 src/backend/access/gist/gist.c        |  1 +
 src/backend/access/gist/gistvacuum.c  | 13 +++++++++++++
 src/backend/access/hash/hash.c        | 14 ++++++++++++++
 src/backend/access/index/indexam.c    | 22 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtree.c    | 14 ++++++++++++++
 src/backend/access/spgist/spgutils.c  |  1 +
 src/backend/access/spgist/spgvacuum.c | 13 +++++++++++++
 src/include/access/amapi.h            |  7 ++++++-
 src/include/access/genam.h            | 16 ++++++++++++++--
 src/include/access/gin_private.h      |  2 ++
 src/include/access/gist_private.h     |  2 ++
 src/include/access/hash.h             |  2 ++
 src/include/access/nbtree.h           |  2 ++
 src/include/access/spgist.h           |  2 ++
 20 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index a22a6dfa40..8395d31450 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -202,6 +202,8 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..8098d75c82 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 88b0a6d290..b5b8df34ed 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,19 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..181dc51268 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..fc375332fc 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..985fb27ba1 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,19 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 992936cfa8..6d047b9f87 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..d462984b3d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,19 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..fb439d23a8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -444,6 +445,19 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..171ba5c2fa 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -678,6 +678,28 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - ask index vacuum strategy
+ *
+ * This callback routine is called just before vacuuming the heap.
+ * Returns IndexVacuumStrategy value to tell the lazy vacuum whether to
+ * do index deletion.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info, struct VacuumParams *params)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+
+	/* amvacuumstrategy is optional; assume do bulk-deletion */
+	if (indexRelation->rd_indam->amvacuumstrategy == NULL)
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	return indexRelation->rd_indam->amvacuumstrategy(info, params);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..863430f910 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -864,6 +865,19 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..7b2313590a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..5de6dd0fdf 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,19 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..548f2033a4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -22,8 +22,9 @@
 struct PlannerInfo;
 struct IndexPath;
 
-/* Likewise, this file shouldn't depend on execnodes.h. */
+/* Likewise, this file shouldn't depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 
 /*
@@ -112,6 +113,9 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   IndexUniqueCheck checkUnique,
 								   bool indexUnchanged,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info,
+														  struct VacuumParams *params);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -259,6 +263,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..6c1c4798e3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -21,8 +21,9 @@
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-/* We don't want this file to depend on execnodes.h. */
+/* We don't want this file to depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 /*
  * Struct for statistics returned by ambuild
@@ -34,7 +35,8 @@ typedef struct IndexBuildResult
 } IndexBuildResult;
 
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -125,6 +127,14 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -174,6 +184,8 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info,
+												 struct VacuumParams *params);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..5c48a48917 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -397,6 +397,8 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..303a18da4d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -533,6 +533,8 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 1cce865be2..4c7e064708 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -372,6 +372,8 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..ba120d4a80 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1011,6 +1011,8 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 2eb2f421a8..f591b21ef1 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -212,6 +212,8 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
-- 
2.27.0

#12

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#11)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

I experimented with this today, and I think that it is a good way to
do it. I like the idea of choose_vacuum_strategy() understanding that
heap pages that are subject to many non-HOT updates have a "natural
extra capacity for LP_DEAD items" that it must care about directly (at
least with non-default heap fill factor settings). My early testing
shows that it will often take a surprisingly long time for the most
heavily updated heap page to have more than about 100 LP_DEAD items.

Attached is a rough patch showing what I did here. It was applied on
top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick and
dirty way of simulating what I thought would work best for bottom-up
index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with several
indexes on pgbench_accounts (almost the same as most other bottom-up
deletion benchmarks I've been running). Only one index is "logically
modified" by the updates, but of course we still physically modify all
indexes on every update. I set fill factor to 90 for this benchmark,
which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes (but
doesn't check) that the table has heap fill factor set to 90 -- see my
changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is also
improved significantly). I wanted to keep both the table/heap and the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Thank you for sharing the patch. That makes sense.
+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }
The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after an
overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the VACUUM
component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for the
one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap(). How
much the table and indexes bloated? Also, I'm curious about that which
tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?
I merged these patches and polished it.

In the 0002 patch, we calculate how many LP_DEAD items can be
accumulated in the space on a single heap page left by fillfactor. I
increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
on a heap page. Because otherwise accumulating LP_DEAD items
unnecessarily constrains the number of heap tuples in a single page,
especially when small tuples, as I mentioned before. Previously, we
constrained the number of line pointers to avoid excessive
line-pointer bloat and not require an increase in the size of the work
array. However, once amvacuumstrategy stuff entered the picture,
accumulating line pointers has value. Also, we might want to store the
returned value of amvacuumstrategy so that index AM can refer to it on
index-deletion.

The 0003 patch has btree indexes skip bulk-deletion if the index
doesn't grow since last bulk-deletion. I stored the number of blocks
in the meta page but didn't implement meta page upgrading.
After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

I’ve attached the updated version patch that includes the following changes:

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

Sorry, I missed 0002 patch. I've attached the patch set again.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patchapplication/octet-stream; name=v2-0003-PoC-skip-btree-bulkdelete-if-the-index-doesn-t-gr.patchDownload

From 551f645ec91ed721f1b9c79b79235472e59a3c4d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH v2 3/3] PoC: skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

XXX: need to upgrade the meta page version.
---
 contrib/pageinspect/btreefuncs.c              |  5 +++
 contrib/pageinspect/expected/btree.out        |  3 +-
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 18 +++++++++
 src/backend/access/nbtree/nbtpage.c           |  9 ++++-
 src/backend/access/nbtree/nbtree.c            | 40 +++++++++++++++++--
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 ++-
 src/include/access/nbtree.h                   |  2 +
 src/include/access/nbtxlog.h                  |  1 +
 9 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 445605db58..94f648118f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -692,6 +692,11 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = "f";
 	}
 
+	if (metad->btm_version >= BTREE_VERSION)
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
+	else
+		values[j++] = "-1";
+
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
 								   values);
 
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 17bf0c5470..5362bcb475 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -3,7 +3,7 @@ INSERT INTO test1 VALUES (72057594037927937, 'text');
 CREATE INDEX test1_a_idx ON test1 USING btree (a);
 \x
 SELECT * FROM bt_metap('test1_a_idx');
--[ RECORD 1 ]-----------+-------
+-[ RECORD 1 ]-----------+-----------
 magic                   | 340322
 version                 | 4
 root                    | 1
@@ -13,6 +13,7 @@ fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
 allequalimage           | t
+last_deletion_nblocks   | 4294967295
 
 SELECT * FROM bt_page_stats('test1_a_idx', 0);
 ERROR:  block 0 is a meta page
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index 9dc342fabc..a87b74ce2a 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -39,3 +39,21 @@ CREATE FUNCTION gist_page_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gist_page_items'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..d686f25a7a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -82,6 +82,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = InvalidBlockNumber;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -121,6 +122,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -185,17 +187,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -217,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -236,6 +242,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e00e5fe0a4..56162cf41c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,16 +878,50 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
- * is specified to off.
+ * Choose the vacuum strategy. Do bulk-deletion or nothing
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
 {
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	/*
+	 * Don't want to do bulk-deletion if index cleanup is disabled
+	 * by the user request.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 		return INDEX_VACUUM_STRATEGY_NONE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+
+	if (metad->btm_version < BTREE_VERSION)
+	{
+		/*
+		 * Do bulk-deletion if metapage needs upgrade, because we don't
+		 * have meta-information yet.
+		 */
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
 	else
-		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+	{
+		BlockNumber	nblocks = RelationGetNumberOfBlocks(info->index);
+
+		/*
+		 * Do deletion if the index grows since the last deletion, by
+		 * even one block,	or for the first time.
+		 */
+		if (!BlockNumberIsValid(metad->btm_last_deletion_nblocks) ||
+			 nblocks > metad->btm_last_deletion_nblocks)
+			result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+	}
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..37546f566d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..4e58b0bc07 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ba120d4a80..f116e29735 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -110,6 +110,8 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..bc0c52a779 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patchapplication/octet-stream; name=v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patchDownload

From 57c5c5199255227bc75ed7ba1b2a5f70136f17f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:34:10 +0900
Subject: [PATCH v2 1/3] Introduce IndexAM API for choosing index vacuum
 strategy.

---
 contrib/bloom/bloom.h                 |  2 ++
 contrib/bloom/blutils.c               |  1 +
 contrib/bloom/blvacuum.c              | 13 +++++++++++++
 src/backend/access/brin/brin.c        |  1 +
 src/backend/access/gin/ginutil.c      |  1 +
 src/backend/access/gin/ginvacuum.c    | 13 +++++++++++++
 src/backend/access/gist/gist.c        |  1 +
 src/backend/access/gist/gistvacuum.c  | 13 +++++++++++++
 src/backend/access/hash/hash.c        | 14 ++++++++++++++
 src/backend/access/index/indexam.c    | 22 ++++++++++++++++++++++
 src/backend/access/nbtree/nbtree.c    | 14 ++++++++++++++
 src/backend/access/spgist/spgutils.c  |  1 +
 src/backend/access/spgist/spgvacuum.c | 13 +++++++++++++
 src/include/access/amapi.h            |  7 ++++++-
 src/include/access/genam.h            | 16 ++++++++++++++--
 src/include/access/gin_private.h      |  2 ++
 src/include/access/gist_private.h     |  2 ++
 src/include/access/hash.h             |  2 ++
 src/include/access/nbtree.h           |  2 ++
 src/include/access/spgist.h           |  2 ++
 20 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index a22a6dfa40..8395d31450 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -202,6 +202,8 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..8098d75c82 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 88b0a6d290..b5b8df34ed 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,19 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..181dc51268 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..fc375332fc 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..985fb27ba1 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,19 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 992936cfa8..6d047b9f87 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..d462984b3d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,19 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..fb439d23a8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -444,6 +445,19 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..171ba5c2fa 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -678,6 +678,28 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - ask index vacuum strategy
+ *
+ * This callback routine is called just before vacuuming the heap.
+ * Returns IndexVacuumStrategy value to tell the lazy vacuum whether to
+ * do index deletion.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info, struct VacuumParams *params)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+
+	/* amvacuumstrategy is optional; assume do bulk-deletion */
+	if (indexRelation->rd_indam->amvacuumstrategy == NULL)
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	return indexRelation->rd_indam->amvacuumstrategy(info, params);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..863430f910 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -864,6 +865,19 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..7b2313590a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..5de6dd0fdf 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,19 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..548f2033a4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -22,8 +22,9 @@
 struct PlannerInfo;
 struct IndexPath;
 
-/* Likewise, this file shouldn't depend on execnodes.h. */
+/* Likewise, this file shouldn't depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 
 /*
@@ -112,6 +113,9 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   IndexUniqueCheck checkUnique,
 								   bool indexUnchanged,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info,
+														  struct VacuumParams *params);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -259,6 +263,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..6c1c4798e3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -21,8 +21,9 @@
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-/* We don't want this file to depend on execnodes.h. */
+/* We don't want this file to depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 /*
  * Struct for statistics returned by ambuild
@@ -34,7 +35,8 @@ typedef struct IndexBuildResult
 } IndexBuildResult;
 
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -125,6 +127,14 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -174,6 +184,8 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info,
+												 struct VacuumParams *params);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..5c48a48917 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -397,6 +397,8 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..303a18da4d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -533,6 +533,8 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 1cce865be2..4c7e064708 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -372,6 +372,8 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..ba120d4a80 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1011,6 +1011,8 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 2eb2f421a8..f591b21ef1 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -212,6 +212,8 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
-- 
2.27.0

v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patchapplication/octet-stream; name=v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patchDownload

From 0972e0e2eb70b00a4965c0ee64fc61e6b06c1d67 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:35:35 +0900
Subject: [PATCH v2 2/3] Choose index vacuum strategy based on amvacuumstrategy
 IndexAM API.

If index_cleanup option is specified neither VACUUM command nor
storage option, lazy vacuum asks each index the vacuum strategy before
heap vacuum and decides whether or not to remove the collected garbage
tuples from the heap based on both the answers of amvacuumstrategy and
how many LP_DEAD items can be accumlated in a space of heap page left
by fillfactor.

The decision made by lazy vacuum and the answer returned from
amvacuumstrategy are passed to ambulkdelete. Then each index can
choose whether or not to skip index bulk-deletion accordingly.
---
 contrib/bloom/blvacuum.c                      |  10 +-
 doc/src/sgml/indexam.sgml                     |  25 ++
 doc/src/sgml/ref/create_table.sgml            |  16 +-
 src/backend/access/brin/brin.c                |   7 +-
 src/backend/access/common/reloptions.c        |  35 +-
 src/backend/access/gin/ginpostinglist.c       |  30 +-
 src/backend/access/gin/ginvacuum.c            |  12 +
 src/backend/access/gist/gistvacuum.c          |  15 +-
 src/backend/access/hash/hash.c                |   8 +
 src/backend/access/heap/vacuumlazy.c          | 356 ++++++++++++++----
 src/backend/access/nbtree/nbtree.c            |  20 +
 src/backend/access/spgist/spgvacuum.c         |  14 +-
 src/backend/catalog/index.c                   |   2 +
 src/backend/commands/analyze.c                |   1 +
 src/backend/commands/vacuum.c                 |  23 +-
 src/include/access/genam.h                    |  36 +-
 src/include/access/htup_details.h             |  17 +-
 src/include/commands/vacuum.h                 |  20 +-
 src/include/utils/rel.h                       |  17 +-
 .../expected/test_ginpostinglist.out          |   6 +-
 20 files changed, 508 insertions(+), 162 deletions(-)

diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index b5b8df34ed..c356ec9e85 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -58,6 +58,14 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BloomMetaPageData *metaData;
 	GenericXLogState *gxlogState;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
@@ -185,7 +193,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	BlockNumber npages,
 				blkno;
 
-	if (info->analyze_only)
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (stats == NULL)
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index ec5741df6d..2b0538fa67 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
     ambuild_function ambuild;
     ambuildempty_function ambuildempty;
     aminsert_function aminsert;
+    amvacuumstrategy_function amvacuumstrategy;
     ambulkdelete_function ambulkdelete;
     amvacuumcleanup_function amvacuumcleanup;
     amcanreturn_function amcanreturn;   /* can be NULL */
@@ -346,6 +347,30 @@ aminsert (Relation indexRelation,
 
   <para>
 <programlisting>
+IndexVacuumStrategy
+amvacuumstrategy (IndexVacuumInfo *info);
+</programlisting>
+   Tell <command>VACUUM</command> whether or not the index is willing to
+   delete index tuples.  This callback is called before
+   <function>ambulkdelete</function>.  Possible return values are
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal> and
+   <literal>INDEX_VACUUM_STRATEGY_BULKDELETE</literal>.  From the index
+   pont of view, if the index doesn't need to delete index tuple, it
+   must return <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.  The returned
+   value can be reffered  when <function>ambulkdelete</function> by checking
+   <literal>info-&gt;indvac_strategy</literal>.
+  </para>
+  <para>
+   <command>VACUUM</command> will decide whether or not to delete garbage tuples
+   from the heap based on these returned values from each index and several other
+   factors.  Therefore, if the index referes to heap TID and <command>VACUUM</command>
+   decides to delete garbage tuples from the heap, please note that the index also
+   must delete index tuples even if it returned
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.
+  </para>
+
+  <para>
+<programlisting>
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
               IndexBulkDeleteResult *stats,
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 569f4c9da7..e5c616470b 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1441,13 +1441,15 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </term>
     <listitem>
      <para>
-      Enables or disables index cleanup when <command>VACUUM</command> is
-      run on this table.  The default value is <literal>true</literal>.
-      Disabling index cleanup can speed up <command>VACUUM</command> very
-      significantly, but may also lead to severely bloated indexes if table
-      modifications are frequent.  The <literal>INDEX_CLEANUP</literal>
-      parameter of <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified, overrides
-      the value of this option.
+      Specify index cleanup option when <command>VACUUM</command> is
+      run on this table.  The default value is <literal>auto</literal>, which
+      determines whether to enables or disable index cleanup based on the indexes
+      and the heap.  Disabling index cleanup can speed up
+      <command>VACUUM</command> very significantly, but may also lead to severely
+      bloated indexes if table modifications are frequent.  The
+      <literal>INDEX_CLEANUP</literal> parameter of
+      <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified,
+      overrides the value of this option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 181dc51268..fb70234112 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -801,8 +801,11 @@ brinvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	heapRel;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (!stats)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c687d3ee9e..f6b1046485 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "commands/vacuum.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
 #include "postmaster/postmaster.h"
@@ -140,15 +141,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -492,6 +484,18 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/* values from VacOptTernaryValue */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_TERNARY_DEFAULT},
+	{"true", VACOPT_TERNARY_ENABLED},
+	{"false", VACOPT_TERNARY_DISABLED},
+	{"on", VACOPT_TERNARY_ENABLED},
+	{"off", VACOPT_TERNARY_DISABLED},
+	{"1", VACOPT_TERNARY_ENABLED},
+	{"0", VACOPT_TERNARY_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -516,6 +520,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_TERNARY_DEFAULT,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1856,7 +1871,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)}
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 216b2b9a2c..e49c94b860 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -22,29 +22,29 @@
 
 /*
  * For encoding purposes, item pointers are represented as 64-bit unsigned
- * integers. The lowest 11 bits represent the offset number, and the next
- * lowest 32 bits are the block number. That leaves 21 bits unused, i.e.
- * only 43 low bits are used.
+ * integers. The lowest 13 bits represent the offset number, and the next
+ * lowest 32 bits are the block number. That leaves 19 bits unused, i.e.
+ * only 45 low bits are used.
  *
- * 11 bits is enough for the offset number, because MaxHeapTuplesPerPage <
- * 2^11 on all supported block sizes. We are frugal with the bits, because
+ * 13 bits is enough for the offset number, because MaxHeapTuplesPerPage <
+ * 2^13 on all supported block sizes. We are frugal with the bits, because
  * smaller integers use fewer bytes in the varbyte encoding, saving disk
  * space. (If we get a new table AM in the future that wants to use the full
  * range of possible offset numbers, we'll need to change this.)
  *
- * These 43-bit integers are encoded using varbyte encoding. In each byte,
+ * These 45-bit integers are encoded using varbyte encoding. In each byte,
  * the 7 low bits contain data, while the highest bit is a continuation bit.
  * When the continuation bit is set, the next byte is part of the same
- * integer, otherwise this is the last byte of this integer. 43 bits need
+ * integer, otherwise this is the last byte of this integer. 45 bits need
  * at most 7 bytes in this encoding:
  *
  * 0XXXXXXX
- * 1XXXXXXX 0XXXXYYY
- * 1XXXXXXX 1XXXXYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuuY
+ * 1XXXXXXX 0XXXXXXY
+ * 1XXXXXXX 1XXXXXXY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXXY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuYYY
  *
  * X = bits used for offset number
  * Y = bits used for block number
@@ -73,12 +73,12 @@
 
 /*
  * How many bits do you need to encode offset number? OffsetNumber is a 16-bit
- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 13 ought to be more
  * than enough. It's tempting to derive this from MaxHeapTuplesPerPage, and
  * use the minimum number of bits, but that would require changing the on-disk
  * format if MaxHeapTuplesPerPage changes. Better to leave some slack.
  */
-#define MaxHeapTuplesPerPageBits		11
+#define MaxHeapTuplesPerPageBits		13
 
 /* Max. number of bytes needed to encode the largest supported integer. */
 #define MaxBytesPerInteger				7
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 985fb27ba1..68bec5238a 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -584,6 +584,14 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
 	uint32		nRoot;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
 									   "Gin vacuum temporary context",
 									   ALLOCSET_DEFAULT_SIZES);
@@ -721,6 +729,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		return stats;
 	}
 
+	/* Skip index cleanup if user requests to disable */
+	if (!info->vacuumcleanup_requested)
+		return stats;
+
 	/*
 	 * Set up all-zero stats and cleanup pending inserts if ginbulkdelete
 	 * wasn't called
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index d462984b3d..706454b2f0 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -72,6 +72,14 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -87,8 +95,11 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fb439d23a8..0449638cb3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -483,6 +483,14 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	tuples_removed = 0;
 	num_index_tuples = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..8cd8b42846 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -140,6 +140,7 @@
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
+#define PARALLEL_VACUUM_KEY_IND_STRATEGY	6
 
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
@@ -214,6 +215,18 @@ typedef struct LVShared
 	double		reltuples;
 	bool		estimated_count;
 
+	/*
+	 * Copy of LVRelStats.vacuum_cheap. It tells index AM that lazy vacuum
+	 * will remove dead tuples from the heap after index vacuum.
+	 */
+	bool vacuum_heap;
+
+	/*
+	 * Copy of LVRelStats.indexcleanup_requested. It tells index AM whether
+	 * amvacuumcleanup is requested or not.
+	 */
+	bool indexcleanup_requested;
+
 	/*
 	 * In single process lazy vacuum we could consume more memory during index
 	 * vacuuming or cleanup apart from the memory for heap scanning.  In
@@ -293,8 +306,8 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -313,6 +326,15 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		vacuum_heap;	/* do we remove dead tuples from the heap? */
+	bool		indexcleanup_requested; /* INDEX_CLEANUP is set to false */
+
+	/*
+	 * The array of index vacuum strategies for each index returned from
+	 * amvacuumstrategy. This is allocated in the DSM segment in parallel
+	 * mode and in local memory in non-parallel mode.
+	 */
+	IndexVacuumStrategy *ivstrategies;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -320,6 +342,8 @@ typedef struct LVRelStats
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
 } LVRelStats;
+#define SizeOfIndVacStrategies(nindexes) \
+	(mul_size(sizeof(IndexVacuumStrategy), (nindexes)))
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -343,6 +367,13 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+								   VacuumParams *params, Relation *Irel,
+								   int nindexes, int ndeaditems);
+static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+										  LVRelStats *vacrelstats, Relation *Irel,
+										  int nindexes, IndexBulkDeleteResult **stats,
+										  LVParallelState *lps, int *maxdeadtups);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -351,7 +382,8 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 									LVRelStats *vacrelstats, LVParallelState *lps,
 									int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
+							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+							  IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_index(Relation indrel,
 							   IndexBulkDeleteResult **stats,
 							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
@@ -362,7 +394,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 int nindexes);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -381,7 +414,8 @@ static void vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 								  int nindexes);
 static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
+							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+							 IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 									 LVRelStats *vacrelstats, LVParallelState *lps,
 									 int nindexes);
@@ -442,7 +476,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -501,8 +534,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrelstats->hasindex = (nindexes > 0);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -763,6 +795,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
+	int			maxdeadtups = 0;	/* maximum # of dead tuples in a single page */
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
 				tups_vacuumed,	/* tuples cleaned up by vacuum */
@@ -811,14 +844,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	/*
+	 * index vacuum cleanup is enabled if index cleanup is not disabled,
+	 * i.g., it's true when either default or enabled.
+	 */
+	vacrelstats->indexcleanup_requested =
+		(params->index_cleanup != VACOPT_TERNARY_DISABLED);
+
 	vistest = GlobalVisTestFor(onerel);
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * least two indexes on a table. When the index cleanup is disabled,
+	 * since index bulk-deletion is likely to be no-op we disable a parallel
+	 * vacuum.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1 &&
+		params->index_cleanup != VACOPT_TERNARY_DISABLED)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -846,7 +889,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -1050,19 +1093,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Vacuum the table and its indexes */
+			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+										  Irel, nindexes, indstats,
+										  lps, &maxdeadtups);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1512,32 +1546,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/*
 		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
+		 * doing a second scan.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrelstats->hasindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_tuples = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(nindexes == 0);
+
+			/* Remove tuples from heap if the table has no index */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+			has_dead_tuples = false;
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1663,6 +1681,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+		else
+			maxdeadtups = Max(maxdeadtups,
+							  dead_tuples->num_tuples - prev_dead_count);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1702,14 +1723,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+									  Irel, nindexes, indstats,
+									  lps, &maxdeadtups);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1722,7 +1738,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1775,6 +1791,128 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+							  LVRelStats *vacrelstats, Relation *Irel,
+							  int nindexes, IndexBulkDeleteResult **indstats,
+							  LVParallelState *lps, int *maxdeadtups)
+{
+	/*
+	 * Choose the vacuum strategy for this vacuum cycle.
+	 * choose_vacuum_strategy() will set the decision to
+	 * vacrelstats->vacuum_heap.
+	 */
+	choose_vacuum_strategy(onerel, vacrelstats, params, Irel, nindexes,
+						   *maxdeadtups);
+
+	/* Work on all the indexes, then the heap */
+	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+							nindexes);
+
+	if (vacrelstats->vacuum_heap)
+	{
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * Here, we don't do heap vacuum in this cycle.
+		 *
+		 * Note that vacrelstats->dead_tuples could have tuples which
+		 * became dead after HOT-pruning but are not marked dead yet.
+		 * We do not process them because it's a very rare condition,
+		 * and the next vacuum will process them anyway.
+		 */
+		Assert(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+	*maxdeadtups = 0;
+}
+
+/*
+ * Decide whether or not we remove the collected garbage tuples from the
+ * heap. The decision is set to vacrelstats->vacuum_heap. ndeaditems is
+ * maximum number of LP_DEAD items on any one heap page encountered during
+ * heap scan.
+ */
+static void
+choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+					   VacuumParams *params, Relation *Irel, int nindexes,
+					   int ndeaditems)
+{
+	bool vacuum_heap = true;
+	int i;
+
+	/*
+	 * Ask each index the vacuum strategy, and save them. If even on index
+	 * returns 'none', we can skip heap vacuum in this cycle at least from
+	 * the index strategies point of view. The consequence might be overwritten
+	 * by other factors, see below.
+	 */
+	for (i = 0; i < nindexes; i++)
+	{
+		IndexVacuumStrategy ivstrat;
+		IndexVacuumInfo ivinfo;
+
+		ivinfo.index = Irel[i];
+		ivinfo.message_level = elevel;
+
+		ivstrat = index_vacuum_strategy(&ivinfo, params);
+
+		/* Save the returned value */
+		vacrelstats->ivstrategies[i] = ivstrat;
+
+		if (ivstrat == INDEX_VACUUM_STRATEGY_NONE)
+			vacuum_heap = false;
+	}
+
+	/* If index cleanup option is specified, overwrite the consequence */
+	if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+		vacuum_heap = true;
+	else if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		vacuum_heap = false;
+	else if (!vacuum_heap)
+	{
+		Size freespace = RelationGetTargetPageFreeSpace(onerel,
+														HEAP_DEFAULT_FILLFACTOR);
+		int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7);
+
+		/*
+		 * Check whether we need to delete the collected garbage from the heap,
+		 * from the heap point of view.
+		 *
+		 * The test of ndeaditems_limit is for the maximum number of LP_DEAD
+		 * items on any one heap page encountered during heap scan by caller.
+		 * The general idea here is to preserve the original pristine state of
+		 * the table when it is subject to constant non-HOT updates when heap
+		 * fill factor is reduced from its default.
+		 *
+		 * ndeaditems_limit is calculated by using the freespace left by
+		 * fillfactor -- we can fit (freespace / sizeof(ItemIdData)) LP_DEAD
+		 * items on heap pages before they start to "overflow" with that setting.
+		 * We're trying to avoid having VACUUM call lazy_vacuum_heap() in most
+		 * cases, but we don't want to be too aggressive: it would be risky to
+		 * make the value we test for much higher, since it might be too late
+		 * by the time we actually call lazy_vacuum_heap(). So multiply by 0.7
+		 * is the safety factor.
+		 */
+		if (ndeaditems > ndeaditems_limit)
+			vacuum_heap = true;
+	}
+
+	vacrelstats->vacuum_heap = vacuum_heap;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *
@@ -1818,7 +1956,8 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 
 		for (idx = 0; idx < nindexes; idx++)
 			lazy_vacuum_index(Irel[idx], &stats[idx], vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
+							  vacrelstats->old_live_tuples, vacrelstats,
+							  vacrelstats->ivstrategies[idx]);
 	}
 
 	/* Increase and report the number of index scans */
@@ -1827,7 +1966,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
  *	lazy_vacuum_heap() -- second pass over the heap
  *
@@ -2092,7 +2230,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 							 LVRelStats *vacrelstats, LVParallelState *lps,
 							 int nindexes)
 {
-	int			nworkers;
+	int			nworkers = 0;
 
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
@@ -2108,10 +2246,32 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 			nworkers = lps->nindexes_parallel_cleanup;
 	}
 	else
-		nworkers = lps->nindexes_parallel_bulkdel;
+	{
+		if (vacrelstats->vacuum_heap)
+			nworkers = lps->nindexes_parallel_bulkdel;
+		else
+		{
+			int i;
+
+			/*
+			 * If we don't vacuum heap, index bulk-deletion could be skipped
+			 * depending on indexes. So we calculate how many indexes will do
+			 * index bulk-deletion based on the answers to amvacuumstrategy.
+			 */
+			for (i = 0; i < nindexes; i++)
+			{
+				uint8 vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+
+				if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0 &&
+					vacrelstats->ivstrategies[i] == INDEX_VACUUM_STRATEGY_BULKDELETE)
+					nworkers++;
+			}
+		}
+	}
 
 	/* The leader process will participate */
-	nworkers--;
+	if (nworkers > 0)
+		nworkers--;
 
 	/*
 	 * It is possible that parallel context is initialized with fewer workers
@@ -2120,6 +2280,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 	 */
 	nworkers = Min(nworkers, lps->pcxt->nworkers);
 
+	/* Copy the information to the shared state */
+	lps->lvshared->vacuum_heap = vacrelstats->vacuum_heap;
+	lps->lvshared->indexcleanup_requested = vacrelstats->indexcleanup_requested;
+
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
@@ -2254,7 +2418,8 @@ parallel_vacuum_index(Relation *Irel, IndexBulkDeleteResult **stats,
 
 		/* Do vacuum or cleanup of the index */
 		vacuum_one_index(Irel[idx], &(stats[idx]), lvshared, shared_indstats,
-						 dead_tuples, vacrelstats);
+						 dead_tuples, vacrelstats,
+						 vacrelstats->ivstrategies[idx]);
 	}
 
 	/*
@@ -2295,7 +2460,7 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
 			vacuum_one_index(Irel[i], &(stats[i]), lps->lvshared,
 							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
+							 vacrelstats, vacrelstats->ivstrategies[i]);
 	}
 
 	/*
@@ -2315,7 +2480,8 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 static void
 vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
+				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+				 IndexVacuumStrategy ivstrat)
 {
 	IndexBulkDeleteResult *bulkdelete_res = NULL;
 
@@ -2338,7 +2504,7 @@ vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 						   lvshared->estimated_count, vacrelstats);
 	else
 		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
+						  lvshared->reltuples, vacrelstats, ivstrat);
 
 	/*
 	 * Copy the index bulk-deletion result returned from ambulkdelete and
@@ -2429,7 +2595,8 @@ lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
  */
 static void
 lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+				  IndexVacuumStrategy ivstrat)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2443,7 +2610,9 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vac_strategy; /* buffer access strategy */
+	ivinfo.will_vacuum_heap = vacrelstats->vacuum_heap;
+	ivinfo.indvac_strategy = ivstrat; /* index vacuum strategy */
 
 	/*
 	 * Update error traceback information.
@@ -2461,11 +2630,17 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) dead_tuples);
 
-	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+	/*
+	 * Report the index bulk-deletion stats. If the index returns the
+	 * statistics and we will do vacuum heap, we can assume it have
+	 * done the index bulk-deletion.
+	 */
+	if (*stats && vacrelstats->vacuum_heap)
+		ereport(elevel,
+				(errmsg("scanned index \"%s\" to remove %d row versions",
+						vacrelstats->indname,
+						dead_tuples->num_tuples),
+				 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
@@ -2495,9 +2670,9 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.report_progress = false;
 	ivinfo.estimated_count = estimated_count;
 	ivinfo.message_level = elevel;
-
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.vacuumcleanup_requested = vacrelstats->indexcleanup_requested;
 
 	/*
 	 * Update error traceback information.
@@ -2844,14 +3019,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2876,18 +3051,21 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+				 int nindexes)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
 	vacrelstats->dead_tuples = dead_tuples;
+	vacrelstats->ivstrategies =
+		(IndexVacuumStrategy *) palloc0(SizeOfIndVacStrategies(nindexes));
 }
 
 /*
@@ -3223,10 +3401,12 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	LVDeadTuples *dead_tuples;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	IndexVacuumStrategy *ivstrats;
 	bool	   *can_parallel_vacuum;
 	long		maxtuples;
 	Size		est_shared;
 	Size		est_deadtuples;
+	Size		est_ivstrategies;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -3320,6 +3500,13 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Estimate space for IndexVacuumStrategy -- PARALLEL_VACUUM_KEY_IND_STRATEGY.
+	 */
+	est_ivstrategies = MAXALIGN(SizeOfIndVacStrategies(nindexes));
+	shm_toc_estimate_chunk(&pcxt->estimator, est_ivstrategies);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
 	{
@@ -3372,6 +3559,11 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	lps->wal_usage = wal_usage;
 
+	/* Allocate space for each index strategy */
+	ivstrats = shm_toc_allocate(pcxt->toc, est_ivstrategies);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, ivstrats);
+	vacrelstats->ivstrategies = ivstrats;
+
 	/* Store query string for workers */
 	if (debug_query_string)
 	{
@@ -3507,6 +3699,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	LVShared   *lvshared;
 	LVDeadTuples *dead_tuples;
+	IndexVacuumStrategy *ivstrats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -3548,6 +3741,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	/* Set vacuum strategy space */
+	ivstrats = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, false);
+	vacrelstats.ivstrategies = ivstrats;
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3573,6 +3770,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacrelstats.indname = NULL;
 	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
 
+	vacrelstats.vacuum_heap = lvshared->vacuum_heap;
+	vacrelstats.indexcleanup_requested = lvshared->indexcleanup_requested;
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrelstats;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 863430f910..e00e5fe0a4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -823,6 +823,18 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 */
 		result = true;
 	}
+	else if (!info->vacuumcleanup_requested)
+	{
+		/*
+		 * Skip cleanup if INDEX_CLEANUP is set to false, even if there might
+		 * be a deleted page that can be recycled. If INDEX_CLEANUP continues
+		 * to be disabled, recyclable pages could be left by XID wraparound.
+		 * But in practice it's not so harmful since such workload doesn't need
+		 * to delete and recycle pages in any case and deletion of btree index
+		 * pages is relatively rare.
+		 */
+		result = false;
+	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
@@ -892,6 +904,14 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Relation	rel = info->index;
 	BTCycleId	cycleid;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 5de6dd0fdf..f44043d94f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -920,6 +920,13 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 {
 	spgBulkDeleteState bds;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -950,8 +957,11 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	spgBulkDeleteState bds;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..30b48d6ccb 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3401,6 +3401,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.will_vacuum_heap = true;
+	ivinfo.indvac_strategy = INDEX_VACUUM_STRATEGY_BULKDELETE;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..111addbd6c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -668,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.vacuumcleanup_requested = true;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 462f9a0f82..4ab20b77e6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1870,17 +1870,20 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option if vacuum_index_cleanup reloption is set.
+	 * Otherwise we leave it as 'default', which means that we choose vacuum
+	 * strategy based on the table and index status. See choose_vacuum_strategy().
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
-	/* Set truncate option based on reloptions if not yet */
+	/*
+	 * Set truncate option based on reloptions if not yet. Truncate option
+	 * is true by default.
+	 */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
 	{
 		if (onerel->rd_options == NULL ||
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 6c1c4798e3..f164ec1a54 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -34,6 +34,14 @@ typedef struct IndexBuildResult
 	double		index_tuples;	/* # of tuples inserted into index */
 } IndexBuildResult;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
  * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
  * and amvacuumcleanup
@@ -52,6 +60,26 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+
+	/*
+	 * True if lazy vacuum delete the collected garbage tuples from the
+	 * heap.  If it's false, the index AM can skip index bulk-deletion
+	 * safely.  This field is used only for ambulkdelete.
+	 */
+	bool		will_vacuum_heap;
+
+	/*
+	 * The answer to amvacuumstrategy asked before executing ambulkdelete.
+	 * This field is used only for ambulkdelete.
+	 */
+	IndexVacuumStrategy indvac_strategy;
+
+	/*
+	 * amvacuumcleanup is requested by lazy vacuum. If false, the index AM
+	 * can skip index cleanup. This can be false if INDEX_CLEANUP vacuum option
+	 * is set to false. This field is used only for amvacuumcleanup.
+	 */
+	bool		vacuumcleanup_requested;
 } IndexVacuumInfo;
 
 /*
@@ -127,14 +155,6 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
-/* Result value for amvacuumstrategy */
-typedef enum IndexVacuumStrategy
-{
-	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
-										 * vacuum cycle */
-	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
-} IndexVacuumStrategy;
-
 /*
  * generalized index_ interface routines (in indexam.c)
  */
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
  *
- * Note: with HOT, there could theoretically be more line pointers (not actual
- * tuples) than this on a heap page.  However we constrain the number of line
- * pointers to this anyway, to avoid excessive line-pointer bloat and not
- * require increases in the size of work arrays.
+ * We used to constrain the number of line pointers to avoid excessive
+ * line-pointer bloat and not require increases in the size of work arrays.
+ * But since index vacuum strategy had entered the picture, accumulating
+ * LP_DEAD line pointer has value of skipping index deletion.
+ *
+ * XXX: allowing to fill the heap page with only line pointer seems a overkill.
  */
 #define MaxHeapTuplesPerPage	\
-	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-			(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
 
 /*
  * MaxAttrSize is a somewhat arbitrary upper limit on the declared size of
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 191cbbd004..f2590c3b6e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -184,19 +185,6 @@ typedef struct VacAttrStats
 #define VACOPT_SKIPTOAST 0x40	/* don't process the TOAST table, if any */
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
 
-/*
- * A ternary value used by vacuum parameters.
- *
- * DEFAULT value is used to determine the value based on other
- * configurations, e.g. reloptions.
- */
-typedef enum VacOptTernaryValue
-{
-	VACOPT_TERNARY_DEFAULT = 0,
-	VACOPT_TERNARY_DISABLED,
-	VACOPT_TERNARY_ENABLED,
-} VacOptTernaryValue;
-
 /*
  * Parameters customizing behavior of VACUUM and ANALYZE.
  *
@@ -216,8 +204,10 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup. In
+										 * default mode, it's decided based on
+										 * multiple factors. See
+										 * choose_vacuum_strategy. */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..168dc5d466 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,20 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+/*
+ * A ternary value used by vacuum parameters. This value also is used
+ * for VACUUM command options.
+ *
+ * DEFAULT value is used to determine the value based on other
+ * configurations, e.g. reloptions.
+ */
+typedef enum VacOptTernaryValue
+{
+	VACOPT_TERNARY_DEFAULT = 0,
+	VACOPT_TERNARY_DISABLED,
+	VACOPT_TERNARY_ENABLED,
+} VacOptTernaryValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +318,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptTernaryValue	vacuum_index_cleanup;	/* enables index vacuuming
+												 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 } StdRdOptions;
 
diff --git a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
index 4d0beaecea..f883eb2601 100644
--- a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
+++ b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
@@ -6,11 +6,11 @@ CREATE EXTENSION test_ginpostinglist;
 SELECT test_ginpostinglist();
 NOTICE:  testing with (0, 1), (0, 2), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (0, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (0, 1021), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (4294967294, 1021), max 14 bytes
 NOTICE:  encoded 1 item pointers to 8 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 16 bytes
+NOTICE:  testing with (0, 1), (4294967294, 1021), max 16 bytes
 NOTICE:  encoded 2 item pointers to 16 bytes
  test_ginpostinglist 
 ---------------------
-- 
2.27.0

#13

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Masahiko Sawada (#12)

Re: New IndexAM API controlling index vacuum strategies

Hi, Masahiko-san:

For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch :

For blvacuumstrategy():

+   if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+       return INDEX_VACUUM_STRATEGY_NONE;
+   else
+       return INDEX_VACUUM_STRATEGY_BULKDELETE;

The 'else' can be omitted.

Similar comment for ginvacuumstrategy().

For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch :

If index_cleanup option is specified neither VACUUM command nor
storage option

I think this is what you meant (by not using passive voice):

If index_cleanup option specifies neither VACUUM command nor
storage option,

- * integer, but you can't fit that many items on a page. 11 ought to be
more
+ * integer, but you can't fit that many items on a page. 13 ought to be
more

It would be nice to add a note why the number of bits is increased.

For choose_vacuum_strategy():

+ IndexVacuumStrategy ivstrat;

The variable is only used inside the loop. You can
use vacrelstats->ivstrategies[i] directly and omit the variable.

+ int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) *
0.7);

How was the factor of 0.7 determined ? Comment below only mentioned 'safety
factor' but not how it was chosen.
I also wonder if this factor should be exposed as GUC.

+ if (nworkers > 0)
+ nworkers--;

Should log / assert be added when nworkers is <= 0 ?

+ * XXX: allowing to fill the heap page with only line pointer seems a
overkill.

'a overkill' -> 'an overkill'

Cheers

On Sun, Jan 17, 2021 at 10:21 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Show quoted text

On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan <pg@bowt.ie>

wrote:

I experimented with this today, and I think that it is a good

way to

do it. I like the idea of choose_vacuum_strategy() understanding

that

heap pages that are subject to many non-HOT updates have a

"natural

extra capacity for LP_DEAD items" that it must care about

directly (at

least with non-default heap fill factor settings). My early

testing

shows that it will often take a surprisingly long time for the

most

heavily updated heap page to have more than about 100 LP_DEAD

items.

Attached is a rough patch showing what I did here. It was applied

on

top of my bottom-up index deletion patch series and your
poc_vacuumstrategy.patch patch. This patch was written as a quick

and

dirty way of simulating what I thought would work best for

bottom-up

index deletion for one specific benchmark/test, which was
non-hot-update heavy. This consists of a variant pgbench with

several

indexes on pgbench_accounts (almost the same as most other

bottom-up

deletion benchmarks I've been running). Only one index is

"logically

modified" by the updates, but of course we still physically modify

all

indexes on every update. I set fill factor to 90 for this

benchmark,

which is an important factor for how your VACUUM patch works during
the benchmark.

This rough supplementary patch includes VACUUM logic that assumes

(but

doesn't check) that the table has heap fill factor set to 90 --

see my

changes to choose_vacuum_strategy(). This benchmark is really about
stability over time more than performance (though performance is

also

improved significantly). I wanted to keep both the table/heap and

the
logically unmodified indexes (i.e. 3 out of 4 indexes on
pgbench_accounts) exactly the same size *forever*.

Does this make sense?

Thank you for sharing the patch. That makes sense.
+        if (!vacuum_heap)
+        {
+            if (maxdeadpage > 130 ||
+                /* Also check if maintenance_work_mem space is
running out */
+                vacrelstats->dead_tuples->num_tuples >
+                vacrelstats->dead_tuples->max_tuples / 2)
+                vacuum_heap = true;
+        }
The second test checking if maintenane_work_mem space is running out
also makes sense to me. Perhaps another idea would be to compare the
number of collected garbage tuple to the total number of heap tuples
so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
certain amount of garbage in the table.

Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
pg_stat_database shows an almost ~28% reduction in blks_read after
an

overnight run for the patch series (it was 508,820,699 for the
patches, 705,282,975 for the master branch). I think that the

VACUUM

component is responsible for some of that reduction. There were 11
VACUUMs for the patch, 7 of which did not call lazy_vacuum_heap()
(these 7 VACUUM operations all only dead a btbulkdelete() call for

the

one problematic index on the table, named "abalance_ruin", which my
supplementary patch has hard-coded knowledge of).

That's a very good result in terms of skipping lazy_vacuum_heap().

How

much the table and indexes bloated? Also, I'm curious about that

which

tests in choose_vacuum_strategy() turned vacuum_heap on: 130 test or
test if maintenance_work_mem space is running out? And what was the
impact on clearing all-visible bits?

I merged these patches and polished it.

In the 0002 patch, we calculate how many LP_DEAD items can be
accumulated in the space on a single heap page left by fillfactor. I
increased MaxHeapTuplesPerPage so that we can accumulate LP_DEAD items
on a heap page. Because otherwise accumulating LP_DEAD items
unnecessarily constrains the number of heap tuples in a single page,
especially when small tuples, as I mentioned before. Previously, we
constrained the number of line pointers to avoid excessive
line-pointer bloat and not require an increase in the size of the work
array. However, once amvacuumstrategy stuff entered the picture,
accumulating line pointers has value. Also, we might want to store the
returned value of amvacuumstrategy so that index AM can refer to it on
index-deletion.

The 0003 patch has btree indexes skip bulk-deletion if the index
doesn't grow since last bulk-deletion. I stored the number of blocks
in the meta page but didn't implement meta page upgrading.

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

I’ve attached the updated version patch that includes the following

changes:

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

Sorry, I missed 0002 patch. I've attached the patch set again.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#14

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#11)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

Makes sense.

BTW, your patch has bitrot already. Peter E's recent pageinspect
commit happens to conflict with this patch. It might make sense to
produce a new version that just fixes the bitrot, so that other people
don't have to deal with it each time.

I’ve attached the updated version patch that includes the following changes:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

Some feedback on the first patch:

* Just so you know: I agree with you about handling
VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
think that it's better to do that there, even though this choice may
have some downsides.

* Can you add some "stub" sgml doc changes for this? Doesn't have to
be complete in any way. Just a placeholder for later, that has the
correct general "shape" to orientate the reader of the patch. It can
just be a FIXME comment, plus basic mechanical stuff -- details of the
new amvacuumstrategy_function routine and its signature.

Some feedback on the second patch:

* Why do you move around IndexVacuumStrategy in the second patch?
Looks like a rebasing oversight.

* Actually, do we really need the first and second patches to be
separate patches? I agree that the nbtree patch should be a separate
patch, but dividing the first two sets of changes doesn't seem like it
adds much. Did I miss some something?

* Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
MaxHeapTuplesPerPage appropriate? Here is the relevant section from
the patch:

diff --git a/src/include/access/htup_details.h
b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
*** SNIP ***
 #define MaxHeapTuplesPerPage    \
-    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))

It's true that ItemIdData structs (line pointers) are aligned, but
they're not MAXALIGN()'d. If they were then the on-disk size of line
pointers would generally be 8 bytes, not 4 bytes.

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

Note that you probably understand the issue with MaxHeapTuplesPerPage
for vacuumlazy.c better than I do currently. I'm still trying to
understand your choices, and to understand what is really important
here.

* Maybe add a #define for the value 0.7? (I refer to the value used in
choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
line pointers that we consider too many" cut off point, which is to be
applied throughout lazy_scan_heap() processing.)

* I notice that your new lazy_vacuum_table_and_indexes() function is
the only place that calls lazy_vacuum_table_and_indexes(). I think
that you should merge them together -- replace the only remaining call
to lazy_vacuum_table_and_indexes() with the body of the function
itself. Having a separate lazy_vacuum_table_and_indexes() function
doesn't seem useful to me -- it doesn't actually hide complexity, and
might even be harder to maintain.

* I suggest thinking about what the last item will mean for the
reporting that currently takes place in
lazy_vacuum_table_and_indexes(), but will now go in an expanded
lazy_vacuum_table_and_indexes() -- how do we count the total number of
index scans now?

I don't actually believe that the logic needs to change, but some kind
of consolidation and streamlining seems like it might be helpful.
Maybe just a comment that says "note that all index scans might just
be no-ops because..." -- stuff like that.

* Any idea about how hard it will be to teach is_wraparound VACUUMs to
skip index cleanup automatically, based on some practical/sensible
criteria?

It would be nice to have a basic PoC for that, even if it remains a
PoC for the foreseeable future (i.e. even if it cannot be committed to
Postgres 14). This feature should definitely be something that your
patch series *enables*. I'd feel good about having covered that
question as part of this basic design work if there was a PoC. That
alone should make it 100% clear that it's easy to do (or no harder
than it should be -- it should ideally be compatible with your basic
design). The exact criteria that we use for deciding whether or not to
skip index cleanup (which probably should not just be "this VACUUM is
is_wraparound, good enough" in the final version) may need to be
debated at length on pgsql-hackers. Even still, it is "just a detail"
in the code. Whereas being *able* to do that with your design (now or
in the future) seems essential now.

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

This patch is not the hard part, of course -- there really isn't that
much needed here compared to vacuumlazy.c. So this patch seems like
the simplest 1 out of the 3 (at least to me).

Some feedback on the third patch:

* The new btm_last_deletion_nblocks metapage field should use P_NONE
(which is 0) to indicate never having been vacuumed -- not
InvalidBlockNumber (which is 0xFFFFFFFF).

This is more idiomatic in nbtree, which is nice, but it has a very
significant practical advantage: it ensures that every heapkeyspace
nbtree index (i.e. those on recent nbtree versions) can be treated as
if it has the new btm_last_deletion_nblocks field all along, even when
it actually built on Postgres 12 or 13. This trick will let you avoid
dealing with the headache of bumping BTREE_VERSION, which is a huge
advantage.

Note that this is the same trick I used to avoid bumping BTREE_VERSION
when the btm_allequalimage field needed to be added (for the nbtree
deduplication feature added to Postgres 13).

* Forgot to do this in the third patch (think I made this same mistake
once myself):

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..88dfea9af3 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
     BTMetaPageData *metad;
     TupleDesc   tupleDesc;
     int         j;
-    char       *values[9];
+    char       *values[10];
     Buffer      buffer;
     Page        page;
     HeapTuple   tuple;
@@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)

That's all I have for now...
--
Peter Geoghegan

#15

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#14)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

BTW, I think that increasing MaxHeapTuplesPerPage will make it
necessary to consider tidbitmap.c. Comments at the top of that file
say that it is assumed that MaxHeapTuplesPerPage is about 256. So
there is a risk of introducing performance regressions affecting
bitmap scans here.

Apparently some other DB systems make the equivalent of
MaxHeapTuplesPerPage dynamically configurable at the level of heap
tables. It usually doesn't matter, but it can matter with on-disk
bitmap indexes, where the bitmap must be encoded from raw TIDs (this
must happen before the bitmap is compressed -- there must be a simple
mapping from every possible TID to some bit in a bitmap first). The
item offset component of each heap TID is not usually very large, so
there is a trade-off between keeping the representation of bitmaps
efficient and not unduly restricting the number of distinct heap
tuples on each heap page. I think that there might be a similar
consideration here, in tidbitmap.c (even though it's not concerned
about on-disk bitmaps).

--
Peter Geoghegan

#16

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#15)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Jan 19, 2021 at 4:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

BTW, I think that increasing MaxHeapTuplesPerPage will make it
necessary to consider tidbitmap.c. Comments at the top of that file
say that it is assumed that MaxHeapTuplesPerPage is about 256. So
there is a risk of introducing performance regressions affecting
bitmap scans here.

More concretely, WORDS_PER_PAGE increases from 5 on the master branch
to 16 with the latest version of the patch series on most platforms
(while WORDS_PER_CHUNK is 4 with or without the patches).

--
Peter Geoghegan

#17

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#15)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Jan 20, 2021 at 9:45 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

BTW, I think that increasing MaxHeapTuplesPerPage will make it
necessary to consider tidbitmap.c. Comments at the top of that file
say that it is assumed that MaxHeapTuplesPerPage is about 256. So
there is a risk of introducing performance regressions affecting
bitmap scans here.

Apparently some other DB systems make the equivalent of
MaxHeapTuplesPerPage dynamically configurable at the level of heap
tables. It usually doesn't matter, but it can matter with on-disk
bitmap indexes, where the bitmap must be encoded from raw TIDs (this
must happen before the bitmap is compressed -- there must be a simple
mapping from every possible TID to some bit in a bitmap first). The
item offset component of each heap TID is not usually very large, so
there is a trade-off between keeping the representation of bitmaps
efficient and not unduly restricting the number of distinct heap
tuples on each heap page. I think that there might be a similar
consideration here, in tidbitmap.c (even though it's not concerned
about on-disk bitmaps).

That's a good point. With the patch, MaxHeapTuplesPerPage increased to
2042 with 8k page, and to 8186 with 32k page whereas it's currently
291 with 8k page and 1169 with 32k page. So it is likely to be a
problem as you pointed out. If we change
"MAXALIGN(SizeofHeapTupleHeader)" to "MAXIMUM_ALIGNOF", it's 680 with
8k patch and 2728 with 32k page, which seems much better.

The purpose of increasing MaxHeapTuplesPerPage in the patch is to have
a heap page accumulate more LP_DEAD line pointers. As I explained
before, considering MaxHeapTuplesPerPage, we cannot calculate how many
LP_DEAD line pointers can be accumulated into the space taken by
fillfactor simply by ((the space taken by fillfactor) / (size of line
pointer)). We need to consider both how many line pointers are
available for LP_DEAD and how much space is available for LP_DEAD.

For example, suppose the tuple size is 50 bytes and fillfactor is 80,
each page has 1633 bytes (=(8192-24)*0.2) free space taken by
fillfactor, where 408 line pointers can fit. However, if we store 250
LP_DEAD line pointers into that space, the number of tuples that can
be stored on the page is only 41, although we have 6534 bytes
(=(8192-24)*0.8) where 121 tuples (+line pointers) can fit because
MaxHeapTuplesPerPage is 291. In this case, where the tuple size is 50
and fillfactor is 80, we can accumulate up to about 170 LP_DEAD line
pointers while storing 121 tuples. Increasing MaxHeapTuplesPerPage
raises this 291 limit and enables us to forget the limit when
calculating the maximum number of LP_DEAD line pointers that can be
accumulated on a single page.

An alternative approach would be to calculate it using the average
tuple's size. I think if we know the tuple size, the maximum number of
LP_DEAD line pointers can be accumulated into the single page is the
minimum of the following two formula:

(1) MaxHeapTuplesPerPage - (((BLCKSZ - SizeOfPageHeaderData) *
(fillfactor/100)) / (sizeof(ItemIdData) + tuple_size))); //how many
line pointers are available for LP_DEAD?

(2) ((BLCKSZ - SizeOfPageHeaderData) * ((1 - fillfactor)/100)) /
sizeof(ItemIdData); //how much space is available for LP_DEAD?

But I'd prefer to increase MaxHeapTuplesPerPage but not to affect the
bitmap much rather than introducing a complex theory.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#18

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#14)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

Makes sense.

BTW, your patch has bitrot already. Peter E's recent pageinspect
commit happens to conflict with this patch. It might make sense to
produce a new version that just fixes the bitrot, so that other people
don't have to deal with it each time.

I’ve attached the updated version patch that includes the following changes:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

Thank you for reviewing the patches.

Some feedback on the first patch:

* Just so you know: I agree with you about handling
VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
think that it's better to do that there, even though this choice may
have some downsides.

* Can you add some "stub" sgml doc changes for this? Doesn't have to
be complete in any way. Just a placeholder for later, that has the
correct general "shape" to orientate the reader of the patch. It can
just be a FIXME comment, plus basic mechanical stuff -- details of the
new amvacuumstrategy_function routine and its signature.

0002 patch had the doc update (I mistakenly included it to 0002
patch). Is that update what you meant?

Some feedback on the second patch:

* Why do you move around IndexVacuumStrategy in the second patch?
Looks like a rebasing oversight.

Check.

* Actually, do we really need the first and second patches to be
separate patches? I agree that the nbtree patch should be a separate
patch, but dividing the first two sets of changes doesn't seem like it
adds much. Did I miss some something?

I separated the patches since I used to have separate patches when
adding other index AM options required by parallel vacuum. But I
agreed to merge the first two patches.

* Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
MaxHeapTuplesPerPage appropriate? Here is the relevant section from
the patch:

diff --git a/src/include/access/htup_details.h
b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
/*
* MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
* fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
*** SNIP ***
#define MaxHeapTuplesPerPage    \
-    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))

It's true that ItemIdData structs (line pointers) are aligned, but
they're not MAXALIGN()'d. If they were then the on-disk size of line
pointers would generally be 8 bytes, not 4 bytes.

You're right. Will fix.

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

Note that you probably understand the issue with MaxHeapTuplesPerPage
for vacuumlazy.c better than I do currently. I'm still trying to
understand your choices, and to understand what is really important
here.

Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
need to discuss how to deal with that.

* Maybe add a #define for the value 0.7? (I refer to the value used in
choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
line pointers that we consider too many" cut off point, which is to be
applied throughout lazy_scan_heap() processing.)

Agreed.

* I notice that your new lazy_vacuum_table_and_indexes() function is
the only place that calls lazy_vacuum_table_and_indexes(). I think
that you should merge them together -- replace the only remaining call
to lazy_vacuum_table_and_indexes() with the body of the function
itself. Having a separate lazy_vacuum_table_and_indexes() function
doesn't seem useful to me -- it doesn't actually hide complexity, and
might even be harder to maintain.

lazy_vacuum_table_and_indexes() is called at two places: after
maintenance_work_mem run out (around L1097) and the end of
lazy_scan_heap() (around L1726). I defined this function to pack the
operations such as choosing a strategy, vacuuming indexes and
vacuuming heap. Without this function, will we end up writing the same
codes twice there?

* I suggest thinking about what the last item will mean for the
reporting that currently takes place in
lazy_vacuum_table_and_indexes(), but will now go in an expanded
lazy_vacuum_table_and_indexes() -- how do we count the total number of
index scans now?

I don't actually believe that the logic needs to change, but some kind
of consolidation and streamlining seems like it might be helpful.
Maybe just a comment that says "note that all index scans might just
be no-ops because..." -- stuff like that.

What do you mean by the last item and what report? I think
lazy_vacuum_table_and_indexes() itself doesn't report anything and
vacrelstats->num_index_scan counts the total number of index scans.

* Any idea about how hard it will be to teach is_wraparound VACUUMs to
skip index cleanup automatically, based on some practical/sensible
criteria?

One simple idea would be to have a to-prevent-wraparound autovacuum
worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED
to index_cleanup). But a downside (but not a common case) is that
since a to-prevent-wraparound vacuum is not necessarily an aggressive
vacuum, it could skip index cleanup even though it cannot move
relfrozenxid forward.

It would be nice to have a basic PoC for that, even if it remains a
PoC for the foreseeable future (i.e. even if it cannot be committed to
Postgres 14). This feature should definitely be something that your
patch series *enables*. I'd feel good about having covered that
question as part of this basic design work if there was a PoC. That
alone should make it 100% clear that it's easy to do (or no harder
than it should be -- it should ideally be compatible with your basic
design). The exact criteria that we use for deciding whether or not to
skip index cleanup (which probably should not just be "this VACUUM is
is_wraparound, good enough" in the final version) may need to be
debated at length on pgsql-hackers. Even still, it is "just a detail"
in the code. Whereas being *able* to do that with your design (now or
in the future) seems essential now.

Agreed. I'll write a PoC patch for that.

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

This patch is not the hard part, of course -- there really isn't that
much needed here compared to vacuumlazy.c. So this patch seems like
the simplest 1 out of the 3 (at least to me).

Some feedback on the third patch:

* The new btm_last_deletion_nblocks metapage field should use P_NONE
(which is 0) to indicate never having been vacuumed -- not
InvalidBlockNumber (which is 0xFFFFFFFF).

This is more idiomatic in nbtree, which is nice, but it has a very
significant practical advantage: it ensures that every heapkeyspace
nbtree index (i.e. those on recent nbtree versions) can be treated as
if it has the new btm_last_deletion_nblocks field all along, even when
it actually built on Postgres 12 or 13. This trick will let you avoid
dealing with the headache of bumping BTREE_VERSION, which is a huge
advantage.

Note that this is the same trick I used to avoid bumping BTREE_VERSION
when the btm_allequalimage field needed to be added (for the nbtree
deduplication feature added to Postgres 13).

That's a nice way with a great advantage. I'll use P_NONE.

* Forgot to do this in the third patch (think I made this same mistake
once myself):

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..88dfea9af3 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
BTMetaPageData *metad;
TupleDesc   tupleDesc;
int         j;
-    char       *values[9];
+    char       *values[10];
Buffer      buffer;
Page        page;
HeapTuple   tuple;
@@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)

Check.

I'm updating and testing the patch. I'll submit the updated version
patches tomorrow.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#19

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#14)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

I performed another benchmark, similar to the last one but with the
latest version (v2), and over a much longer period. Attached is a
summary of the whole benchmark, and log_autovacuum output from the
logs of both the master branch and the patch.

This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
transaction with one update and two selects. Each run was 4 hours, and
we alternate between patch and master for each run, and alternate
between 16 and 32 clients. There were 8 4 hour runs in total, meaning
the entire set of runs took 8 * 4 hours = 32 hours (not including
initial load time and a few other small things like that). I used a
10k TPS rate limit, so TPS isn't interesting here. Latency is
interesting -- we see a nice improvement in latency (i.e. a reduction)
for the patch (see all.summary.out).

The benefits of the patch are clearly visible when I drill down and
look at the details. Each pgbench_accounts autovacuum VACUUM operation
can finish faster with the patch because they can often skip at least
some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
it's more subtle than some might assume. We're skipping indexes that
VACUUM actually would have deleted *some* index tuples from, which is
very good. Bottom-up index deletion is usually lazy, and only
occasionally very eager, so you still have plenty of "floating
garbage" index tuples in most pages. And now we see VACUUM behave a
little more like bottom-up index deletion -- it is lazy when that is
appropriate (with indexes that really only have floating garbage that
is spread diffusely throughout the index structure), and eager when
that is appropriate (with indexes that get much more garbage).

The benefit is not really that we're avoiding doing I/O for index
vacuuming (though that is one of the smaller benefits here). The real
benefit is that VACUUM is not dirtying pages, since it skips indexes
when it would be "premature" to vacuum them from an efficiency point
of view. This is important because we know that Postgres throughput is
very often limited by page cleaning. Also, the "economics" of this new
behavior make perfect sense -- obviously it's more efficient to delay
garbage cleanup until the point when the same page will be modified by
a backend anyway -- in the case of this benchmark via bottom-up index
deletion (which deletes all garbage tuples in the leaf page at the
point that it runs for a subset of pointed-to heap pages -- it's not
using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
a page, we now get more value per additional-page-dirtied.

I believe that controlling the number of pages dirtied by VACUUM is
usually much more important than reducing the amount of read I/O from
VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
default value and modern hardware" thread. As a further consequence of
all this, VACUUM can go faster safely and sustainably (since the cost
limit is not affected so much by vacuum_cost_page_miss), which has its
own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
end).

Another closely related huge improvement that we see here is that the
number of FPIs generated by VACUUM can be significantly reduced. This
cost is closely related to the cost of dirtying pages, but it's worth
mentioning separately. You'll see some of that in the log_autovacuum
log output I attached.

There is an archive with much more detailed information, including
dumps from most pg_stat_* views at key intervals. This has way more
information than anybody is likely to want:

https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing

I did notice a problem, though. I now think that the criteria for
skipping an index vacuum in the third patch from the series is too
conservative, and that this led to an excessive number of index
vacuums with the patch. This is probably because there was a tiny
number of page splits in some of the indexes that were not really
supposed to grow. I believe that this is caused by ANALYZE running --
I think that it prevented bottom-up deletion from keeping a few of the
hottest pages from splitting (that can take 5 or 6 seconds) at a few
points over the 32 hour run. For example, the index named "tenner"
grew by 9 blocks, starting out at 230,701 and ending up at 230,710 (to
see this, extract the files from the archive and "diff
patch.r1c16.initial_pg_relation_size.out
patch.r2c32.after_pg_relation_size.out").

I now think that 0 blocks added is unnecessarily restrictive -- a
small tolerance still seems like a good idea, though (let's still be
somewhat conservative about it).

Maybe a better criteria would be for nbtree to always proceed with
index vacuuming when the index size is less than 2048 blocks (16MiB
with 8KiB BLCKSZ). If an index is larger than that, then compare the
last/old block count to the current block count (at the point that we
decide if index vacuuming is going to go ahead) by rounding up both
values to the next highest 2048 block increment. This formula is
pretty arbitrary, but probably works as well as most others. It's a
good iteration for the next version of the patch/further testing, at
least.

BTW, it would be nice if there was more instrumentation, say in the
log output produced when log_autovacuum is on. That would make it
easier to run these benchmarks -- I could verify my understanding of
the work done for each particular av operation represented in the log.
Though the default log_autovacuum log output is quite informative, it
would be nice if the specifics were more obvious (maybe this could
just be for the review/testing, but it might become something for
users if it seems useful).

--
Peter Geoghegan

#20

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#18)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Jan 21, 2021 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

Makes sense.

BTW, your patch has bitrot already. Peter E's recent pageinspect
commit happens to conflict with this patch. It might make sense to
produce a new version that just fixes the bitrot, so that other people
don't have to deal with it each time.

I’ve attached the updated version patch that includes the following changes:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

Thank you for reviewing the patches.

Some feedback on the first patch:

* Just so you know: I agree with you about handling
VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
think that it's better to do that there, even though this choice may
have some downsides.

* Can you add some "stub" sgml doc changes for this? Doesn't have to
be complete in any way. Just a placeholder for later, that has the
correct general "shape" to orientate the reader of the patch. It can
just be a FIXME comment, plus basic mechanical stuff -- details of the
new amvacuumstrategy_function routine and its signature.

0002 patch had the doc update (I mistakenly included it to 0002
patch). Is that update what you meant?

Some feedback on the second patch:

* Why do you move around IndexVacuumStrategy in the second patch?
Looks like a rebasing oversight.

Check.

* Actually, do we really need the first and second patches to be
separate patches? I agree that the nbtree patch should be a separate
patch, but dividing the first two sets of changes doesn't seem like it
adds much. Did I miss some something?

I separated the patches since I used to have separate patches when
adding other index AM options required by parallel vacuum. But I
agreed to merge the first two patches.
* Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
MaxHeapTuplesPerPage appropriate? Here is the relevant section from
the patch:
diff --git a/src/include/access/htup_details.h
b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
/*
* MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
* fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
*** SNIP ***
#define MaxHeapTuplesPerPage    \
-    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
It's true that ItemIdData structs (line pointers) are aligned, but
they're not MAXALIGN()'d. If they were then the on-disk size of line
pointers would generally be 8 bytes, not 4 bytes.
You're right. Will fix.

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

Note that you probably understand the issue with MaxHeapTuplesPerPage
for vacuumlazy.c better than I do currently. I'm still trying to
understand your choices, and to understand what is really important
here.

Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
need to discuss how to deal with that.

* Maybe add a #define for the value 0.7? (I refer to the value used in
choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
line pointers that we consider too many" cut off point, which is to be
applied throughout lazy_scan_heap() processing.)

Agreed.

* I notice that your new lazy_vacuum_table_and_indexes() function is
the only place that calls lazy_vacuum_table_and_indexes(). I think
that you should merge them together -- replace the only remaining call
to lazy_vacuum_table_and_indexes() with the body of the function
itself. Having a separate lazy_vacuum_table_and_indexes() function
doesn't seem useful to me -- it doesn't actually hide complexity, and
might even be harder to maintain.

lazy_vacuum_table_and_indexes() is called at two places: after
maintenance_work_mem run out (around L1097) and the end of
lazy_scan_heap() (around L1726). I defined this function to pack the
operations such as choosing a strategy, vacuuming indexes and
vacuuming heap. Without this function, will we end up writing the same
codes twice there?

* I suggest thinking about what the last item will mean for the
reporting that currently takes place in
lazy_vacuum_table_and_indexes(), but will now go in an expanded
lazy_vacuum_table_and_indexes() -- how do we count the total number of
index scans now?

I don't actually believe that the logic needs to change, but some kind
of consolidation and streamlining seems like it might be helpful.
Maybe just a comment that says "note that all index scans might just
be no-ops because..." -- stuff like that.

What do you mean by the last item and what report? I think
lazy_vacuum_table_and_indexes() itself doesn't report anything and
vacrelstats->num_index_scan counts the total number of index scans.

* Any idea about how hard it will be to teach is_wraparound VACUUMs to
skip index cleanup automatically, based on some practical/sensible
criteria?

One simple idea would be to have a to-prevent-wraparound autovacuum
worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED
to index_cleanup). But a downside (but not a common case) is that
since a to-prevent-wraparound vacuum is not necessarily an aggressive
vacuum, it could skip index cleanup even though it cannot move
relfrozenxid forward.

It would be nice to have a basic PoC for that, even if it remains a
PoC for the foreseeable future (i.e. even if it cannot be committed to
Postgres 14). This feature should definitely be something that your
patch series *enables*. I'd feel good about having covered that
question as part of this basic design work if there was a PoC. That
alone should make it 100% clear that it's easy to do (or no harder
than it should be -- it should ideally be compatible with your basic
design). The exact criteria that we use for deciding whether or not to
skip index cleanup (which probably should not just be "this VACUUM is
is_wraparound, good enough" in the final version) may need to be
debated at length on pgsql-hackers. Even still, it is "just a detail"
in the code. Whereas being *able* to do that with your design (now or
in the future) seems essential now.

Agreed. I'll write a PoC patch for that.

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

This patch is not the hard part, of course -- there really isn't that
much needed here compared to vacuumlazy.c. So this patch seems like
the simplest 1 out of the 3 (at least to me).

Some feedback on the third patch:

* The new btm_last_deletion_nblocks metapage field should use P_NONE
(which is 0) to indicate never having been vacuumed -- not
InvalidBlockNumber (which is 0xFFFFFFFF).

This is more idiomatic in nbtree, which is nice, but it has a very
significant practical advantage: it ensures that every heapkeyspace
nbtree index (i.e. those on recent nbtree versions) can be treated as
if it has the new btm_last_deletion_nblocks field all along, even when
it actually built on Postgres 12 or 13. This trick will let you avoid
dealing with the headache of bumping BTREE_VERSION, which is a huge
advantage.

Note that this is the same trick I used to avoid bumping BTREE_VERSION
when the btm_allequalimage field needed to be added (for the nbtree
deduplication feature added to Postgres 13).

That's a nice way with a great advantage. I'll use P_NONE.
* Forgot to do this in the third patch (think I made this same mistake
once myself):
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..88dfea9af3 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
BTMetaPageData *metad;
TupleDesc   tupleDesc;
int         j;
-    char       *values[9];
+    char       *values[10];
Buffer      buffer;
Page        page;
HeapTuple   tuple;
@@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)
Check.

I'm updating and testing the patch. I'll submit the updated version
patches tomorrow.

Sorry for the late.

I've attached the updated version patch that incorporated the comments
I got so far.

I merged the previous 0001 and 0002 patches. 0003 patch is now another
PoC patch that disables index cleanup automatically when autovacuum is
to prevent xid-wraparound and an aggressive vacuum.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v3-0002-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchapplication/octet-stream; name=v3-0002-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchDownload

From 08bdea5d66f5ae11f564b1c1d638518bcf110f60 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH v3 2/3] Skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes. A new field,
btm_last_deletion_nblocks, is P_NONE, 0, if not set yet.
---
 contrib/pageinspect/btreefuncs.c              |  4 ++-
 contrib/pageinspect/expected/btree.out        |  1 +
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 18 +++++++++++
 src/backend/access/nbtree/nbtpage.c           |  9 +++++-
 src/backend/access/nbtree/nbtree.c            | 31 ++++++++++++++++---
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 +--
 src/include/access/nbtree.h                   |  3 ++
 src/include/access/nbtxlog.h                  |  1 +
 9 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..30b1892222 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[9];
+	char	   *values[10];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -726,12 +726,14 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
 		values[j++] = "f";
+		values[j++] = "0";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index a7632be36a..ae1aea8a6f 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -13,6 +13,7 @@ fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
 allequalimage           | t
+last_deletion_nblocks   | 0
 
 SELECT * FROM bt_page_stats('test1_a_idx', -1);
 ERROR:  invalid block number
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index b4248d791f..63725f8522 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -116,3 +116,21 @@ CREATE FUNCTION brin_page_items(IN page bytea, IN index_oid regclass,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'brin_page_items'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..0a16e9db9b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -82,6 +82,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = P_NONE;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -121,6 +122,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -185,17 +187,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -217,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -236,6 +242,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e00e5fe0a4..0db6b6b632 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,16 +878,39 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
- * is specified to off.
+ * Choose the vacuum strategy. Do bulk-deletion or nothing
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
 {
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	BlockNumber	nblocks;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	/*
+	 * Don't want to do bulk-deletion if index cleanup is disabled
+	 * by the user request.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 		return INDEX_VACUUM_STRATEGY_NONE;
-	else
-		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(info->index);
+
+	/*
+	 * Do deletion if the index grows since the last deletion by
+	 * even one block or for the first time.
+	 */
+	if (metad->btm_last_deletion_nblocks != P_NONE ||
+		nblocks > metad->btm_last_deletion_nblocks)
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..37546f566d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..4e58b0bc07 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ba120d4a80..35c6858573 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -110,6 +110,9 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion. P_NONE if not
+											 * set. */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..bc0c52a779 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

v3-0003-PoC-disable-index-cleanup-when-an-anti-wraparound.patchapplication/octet-stream; name=v3-0003-PoC-disable-index-cleanup-when-an-anti-wraparound.patchDownload

From a6359a3227d55cf68a9cfd137baad394137536c6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jan 2021 16:20:37 +0900
Subject: [PATCH v3 3/3] PoC: disable index cleanup when an anti-wraparound and
 aggressive vacuum.

---
 src/backend/access/heap/vacuumlazy.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b99b7e51f4..8ed8a17ec2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -529,6 +529,23 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
+	/*
+	 * If the vacuum is initiated to prevent xid-wraparound and is an aggressive
+	 * scan, we disable index cleanup to make freezing heap tuples and moving
+	 * relfrozenxid forward complete faster.
+	 *
+	 * Note that this applies only autovacuums as is_wraparound can be true
+	 * in autovacuums.
+	 *
+	 * XXX: should we not disable index cleanup if vacuum_index_cleanup reloption
+	 * is on?
+	 */
+	if (aggressive && params->is_wraparound)
+	{
+		Assert(IsAutoVacuumWorkerProcess());
+		params->index_cleanup = VACOPT_TERNARY_DISABLED;
+	}
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-- 
2.27.0

v3-0001-Choose-vacuum-strategy-before-heap-and-index-vacu.patchapplication/octet-stream; name=v3-0001-Choose-vacuum-strategy-before-heap-and-index-vacu.patchDownload

From 8171d5a4b7e6a483f4213a0804cd52d429874f13 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:34:10 +0900
Subject: [PATCH v3 1/3] Choose vacuum strategy before heap and index vacuums.

If index_cleanup option is specified neither VACUUM command nor
storage option, lazy vacuum asks each index the vacuum strategy before
heap vacuum and decides whether or not to remove the collected garbage
tuples from the heap based on both the answers of amvacuumstrategy, a
new index AM API introduced in this commit, and how many LP_DEAD items
can be accumlated in a space of heap page left by fillfactor.

The decision made by lazy vacuum and the answer returned from
amvacuumstrategy are passed to ambulkdelete. Then each index can
choose whether or not to skip index bulk-deletion accordingly.
---
 contrib/bloom/bloom.h                         |   2 +
 contrib/bloom/blutils.c                       |   1 +
 contrib/bloom/blvacuum.c                      |  23 +-
 doc/src/sgml/indexam.sgml                     |  25 ++
 doc/src/sgml/ref/create_table.sgml            |  19 +-
 src/backend/access/brin/brin.c                |   8 +-
 src/backend/access/common/reloptions.c        |  40 +-
 src/backend/access/gin/ginpostinglist.c       |  30 +-
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gin/ginvacuum.c            |  25 ++
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/gist/gistvacuum.c          |  28 +-
 src/backend/access/hash/hash.c                |  22 +
 src/backend/access/heap/vacuumlazy.c          | 376 ++++++++++++++----
 src/backend/access/index/indexam.c            |  22 +
 src/backend/access/nbtree/nbtree.c            |  34 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/access/spgist/spgvacuum.c         |  27 +-
 src/backend/catalog/index.c                   |   2 +
 src/backend/commands/analyze.c                |   1 +
 src/backend/commands/vacuum.c                 |  23 +-
 src/include/access/amapi.h                    |   7 +-
 src/include/access/genam.h                    |  36 +-
 src/include/access/gin_private.h              |   2 +
 src/include/access/gist_private.h             |   2 +
 src/include/access/hash.h                     |   2 +
 src/include/access/htup_details.h             |  21 +-
 src/include/access/nbtree.h                   |   2 +
 src/include/access/spgist.h                   |   2 +
 src/include/commands/vacuum.h                 |  20 +-
 src/include/utils/rel.h                       |  17 +-
 .../expected/test_ginpostinglist.out          |   6 +-
 32 files changed, 672 insertions(+), 156 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index a22a6dfa40..8395d31450 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -202,6 +202,8 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..8098d75c82 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 88b0a6d290..c356ec9e85 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,19 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -45,6 +58,14 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BloomMetaPageData *metaData;
 	GenericXLogState *gxlogState;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
@@ -172,7 +193,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	BlockNumber npages,
 				blkno;
 
-	if (info->analyze_only)
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (stats == NULL)
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index ec5741df6d..9f881303f6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
     ambuild_function ambuild;
     ambuildempty_function ambuildempty;
     aminsert_function aminsert;
+    amvacuumstrategy_function amvacuumstrategy;
     ambulkdelete_function ambulkdelete;
     amvacuumcleanup_function amvacuumcleanup;
     amcanreturn_function amcanreturn;   /* can be NULL */
@@ -346,6 +347,30 @@ aminsert (Relation indexRelation,
 
   <para>
 <programlisting>
+IndexVacuumStrategy
+amvacuumstrategy (IndexVacuumInfo *info);
+</programlisting>
+   Tell <command>VACUUM</command> whether or not the index is willing to
+   delete index tuples.  This callback is called before
+   <function>ambulkdelete</function>.  Possible return values are
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal> and
+   <literal>INDEX_VACUUM_STRATEGY_BULKDELETE</literal>.  From the index
+   pont of view, if the index doesn't need to delete index tuple, it
+   must return <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.  The returned
+   value can be referred  when <function>ambulkdelete</function> by checking
+   <literal>info-&gt;indvac_strategy</literal>.
+  </para>
+  <para>
+   <command>VACUUM</command> will decide whether or not to delete garbage tuples
+   from the heap based on these returned values from each index and several other
+   factors.  Therefore, if the index refers to heap TID and <command>VACUUM</command>
+   decides to delete garbage tuples from the heap, please note that the index also
+   must delete index tuples even if it returned
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.
+  </para>
+
+  <para>
+<programlisting>
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
               IndexBulkDeleteResult *stats,
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 569f4c9da7..c45cdcb292 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1434,20 +1434,23 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
    </varlistentry>
 
    <varlistentry id="reloption-vacuum-index-cleanup" xreflabel="vacuum_index_cleanup">
-    <term><literal>vacuum_index_cleanup</literal>, <literal>toast.vacuum_index_cleanup</literal> (<type>boolean</type>)
+    <term><literal>vacuum_index_cleanup</literal>, <literal>toast.vacuum_index_cleanup</literal> (<type>enum</type>)
     <indexterm>
      <primary><varname>vacuum_index_cleanup</varname> storage parameter</primary>
     </indexterm>
     </term>
     <listitem>
      <para>
-      Enables or disables index cleanup when <command>VACUUM</command> is
-      run on this table.  The default value is <literal>true</literal>.
-      Disabling index cleanup can speed up <command>VACUUM</command> very
-      significantly, but may also lead to severely bloated indexes if table
-      modifications are frequent.  The <literal>INDEX_CLEANUP</literal>
-      parameter of <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified, overrides
-      the value of this option.
+      Specify index cleanup option when <command>VACUUM</command> is
+      run on this table.  The default value is <literal>auto</literal>, which
+      determines whether to enable or disable index cleanup based on the indexes
+      and the heap.  With <literal>off</literal> index cleanup is disabled, with
+      <literal>on</literal> it is enabled. Disabling index cleanup can speed up
+      <command>VACUUM</command> very significantly, but may also lead to severely
+      bloated indexes if table modifications are frequent.  The
+      <literal>INDEX_CLEANUP</literal> parameter of
+      <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified,
+      overrides the value of this option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..fb70234112 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -800,8 +801,11 @@ brinvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	heapRel;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (!stats)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c687d3ee9e..692455d617 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "commands/vacuum.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
 #include "postmaster/postmaster.h"
@@ -140,15 +141,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -492,6 +484,23 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/*
+ * values from VacOptTernaryValue for index_cleanup option.
+ * Allowing boolean values other than "on" and "off" are for
+ * backward compatibility as the option is used to be an
+ * boolean.
+ */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_TERNARY_DEFAULT},
+	{"true", VACOPT_TERNARY_ENABLED},
+	{"false", VACOPT_TERNARY_DISABLED},
+	{"on", VACOPT_TERNARY_ENABLED},
+	{"off", VACOPT_TERNARY_DISABLED},
+	{"1", VACOPT_TERNARY_ENABLED},
+	{"0", VACOPT_TERNARY_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -516,6 +525,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_TERNARY_DEFAULT,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1856,7 +1876,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)}
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 216b2b9a2c..0322a1736e 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -22,29 +22,29 @@
 
 /*
  * For encoding purposes, item pointers are represented as 64-bit unsigned
- * integers. The lowest 11 bits represent the offset number, and the next
- * lowest 32 bits are the block number. That leaves 21 bits unused, i.e.
- * only 43 low bits are used.
+ * integers. The lowest 12 bits represent the offset number, and the next
+ * lowest 32 bits are the block number. That leaves 20 bits unused, i.e.
+ * only 44 low bits are used.
  *
- * 11 bits is enough for the offset number, because MaxHeapTuplesPerPage <
- * 2^11 on all supported block sizes. We are frugal with the bits, because
+ * 12 bits is enough for the offset number, because MaxHeapTuplesPerPage <
+ * 2^12 on all supported block sizes. We are frugal with the bits, because
  * smaller integers use fewer bytes in the varbyte encoding, saving disk
  * space. (If we get a new table AM in the future that wants to use the full
  * range of possible offset numbers, we'll need to change this.)
  *
- * These 43-bit integers are encoded using varbyte encoding. In each byte,
+ * These 44-bit integers are encoded using varbyte encoding. In each byte,
  * the 7 low bits contain data, while the highest bit is a continuation bit.
  * When the continuation bit is set, the next byte is part of the same
- * integer, otherwise this is the last byte of this integer. 43 bits need
+ * integer, otherwise this is the last byte of this integer. 44 bits need
  * at most 7 bytes in this encoding:
  *
  * 0XXXXXXX
- * 1XXXXXXX 0XXXXYYY
- * 1XXXXXXX 1XXXXYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuuY
+ * 1XXXXXXX 0XXXXXYY
+ * 1XXXXXXX 1XXXXXYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuYY
  *
  * X = bits used for offset number
  * Y = bits used for block number
@@ -73,12 +73,12 @@
 
 /*
  * How many bits do you need to encode offset number? OffsetNumber is a 16-bit
- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 12 ought to be more
  * than enough. It's tempting to derive this from MaxHeapTuplesPerPage, and
  * use the minimum number of bits, but that would require changing the on-disk
  * format if MaxHeapTuplesPerPage changes. Better to leave some slack.
  */
-#define MaxHeapTuplesPerPageBits		11
+#define MaxHeapTuplesPerPageBits		12
 
 /* Max. number of bytes needed to encode the largest supported integer. */
 #define MaxBytesPerInteger				7
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..fc375332fc 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..68bec5238a 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,19 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
@@ -571,6 +584,14 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
 	uint32		nRoot;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
 									   "Gin vacuum temporary context",
 									   ALLOCSET_DEFAULT_SIZES);
@@ -708,6 +729,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		return stats;
 	}
 
+	/* Skip index cleanup if user requests to disable */
+	if (!info->vacuumcleanup_requested)
+		return stats;
+
 	/*
 	 * Set up all-zero stats and cleanup pending inserts if ginbulkdelete
 	 * wasn't called
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f203bb594c..cddcdd83be 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..706454b2f0 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,19 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
@@ -59,6 +72,14 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -74,8 +95,11 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..0449638cb3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -444,6 +445,19 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -469,6 +483,14 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	tuples_removed = 0;
 	num_index_tuples = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..b99b7e51f4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -130,6 +130,15 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * Safety ratio of how many LP_DEAD items can be stored in single heap
+ * page before it starts to overflow.  We're trying to avoid having VACUUM
+ * call lazy_vacuum_heap() in most cases, but we don't want to be too
+ * aggressive: it would be risky to make the value we test for much higher,
+ * since it might be too late by the time we actually call lazy_vacuum_heap().
+ */
+#define DEAD_ITEMS_ON_PAGE_LIMIT_SAFETY_RATIO	0.7
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -140,6 +149,7 @@
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
+#define PARALLEL_VACUUM_KEY_IND_STRATEGY	6
 
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
@@ -214,6 +224,18 @@ typedef struct LVShared
 	double		reltuples;
 	bool		estimated_count;
 
+	/*
+	 * Copy of LVRelStats.vacuum_cheap. It tells index AM that lazy vacuum
+	 * will remove dead tuples from the heap after index vacuum.
+	 */
+	bool vacuum_heap;
+
+	/*
+	 * Copy of LVRelStats.indexcleanup_requested. It tells index AM whether
+	 * amvacuumcleanup is requested or not.
+	 */
+	bool indexcleanup_requested;
+
 	/*
 	 * In single process lazy vacuum we could consume more memory during index
 	 * vacuuming or cleanup apart from the memory for heap scanning.  In
@@ -293,8 +315,8 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -313,6 +335,15 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		vacuum_heap;	/* do we remove dead tuples from the heap? */
+	bool		indexcleanup_requested; /* INDEX_CLEANUP is set to false */
+
+	/*
+	 * The array of index vacuum strategies for each index returned from
+	 * amvacuumstrategy. This is allocated in the DSM segment in parallel
+	 * mode and in local memory in non-parallel mode.
+	 */
+	IndexVacuumStrategy *ivstrategies;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -320,6 +351,8 @@ typedef struct LVRelStats
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
 } LVRelStats;
+#define SizeOfIndVacStrategies(nindexes) \
+	(mul_size(sizeof(IndexVacuumStrategy), (nindexes)))
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -343,6 +376,13 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+								   VacuumParams *params, Relation *Irel,
+								   int nindexes, int ndeaditems);
+static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+										  LVRelStats *vacrelstats, Relation *Irel,
+										  int nindexes, IndexBulkDeleteResult **stats,
+										  LVParallelState *lps, int *maxdeadtups);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -351,7 +391,8 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 									LVRelStats *vacrelstats, LVParallelState *lps,
 									int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
+							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+							  IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_index(Relation indrel,
 							   IndexBulkDeleteResult **stats,
 							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
@@ -362,7 +403,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 int nindexes);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -381,7 +423,8 @@ static void vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 								  int nindexes);
 static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
+							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+							 IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 									 LVRelStats *vacrelstats, LVParallelState *lps,
 									 int nindexes);
@@ -442,7 +485,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -501,8 +543,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrelstats->hasindex = (nindexes > 0);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -763,6 +804,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
+	int			maxdeadtups = 0;	/* maximum # of dead tuples in a single page */
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
 				tups_vacuumed,	/* tuples cleaned up by vacuum */
@@ -811,14 +853,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	/*
+	 * index vacuum cleanup is enabled if index cleanup is not disabled,
+	 * i.g., it's true when either default or enabled.
+	 */
+	vacrelstats->indexcleanup_requested =
+		(params->index_cleanup != VACOPT_TERNARY_DISABLED);
+
 	vistest = GlobalVisTestFor(onerel);
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * least two indexes on a table. When the index cleanup is disabled,
+	 * since index bulk-deletion is likely to be no-op we disable a parallel
+	 * vacuum.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1 &&
+		params->index_cleanup != VACOPT_TERNARY_DISABLED)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -846,7 +898,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -1050,19 +1102,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Vacuum the table and its indexes */
+			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+										  Irel, nindexes, indstats,
+										  lps, &maxdeadtups);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1512,32 +1555,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/*
 		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
+		 * doing a second scan.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrelstats->hasindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_tuples = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(nindexes == 0);
+
+			/* Remove tuples from heap if the table has no index */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+			has_dead_tuples = false;
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1663,6 +1690,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+		else
+			maxdeadtups = Max(maxdeadtups,
+							  dead_tuples->num_tuples - prev_dead_count);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1702,14 +1732,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+									  Irel, nindexes, indstats,
+									  lps, &maxdeadtups);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1722,7 +1747,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1775,6 +1800,140 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+							  LVRelStats *vacrelstats, Relation *Irel,
+							  int nindexes, IndexBulkDeleteResult **indstats,
+							  LVParallelState *lps, int *maxdeadtups)
+{
+	/*
+	 * Choose the vacuum strategy for this vacuum cycle.
+	 * choose_vacuum_strategy() will set the decision to
+	 * vacrelstats->vacuum_heap.
+	 */
+	choose_vacuum_strategy(onerel, vacrelstats, params, Irel, nindexes,
+						   *maxdeadtups);
+
+	/* Work on all the indexes, then the heap */
+	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+							nindexes);
+
+	if (vacrelstats->vacuum_heap)
+	{
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * Here, we don't do heap vacuum in this cycle.
+		 *
+		 * Note that vacrelstats->dead_tuples could have tuples which
+		 * became dead after HOT-pruning but are not marked dead yet.
+		 * We do not process them because it's a very rare condition,
+		 * and the next vacuum will process them anyway.
+		 */
+		Assert(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+	*maxdeadtups = 0;
+}
+
+/*
+ * Decide whether or not we remove the collected garbage tuples from the
+ * heap. The decision is set to vacrelstats->vacuum_heap. ndeaditems is
+ * maximum number of LP_DEAD items on any one heap page encountered during
+ * heap scan.
+ */
+static void
+choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+					   VacuumParams *params, Relation *Irel, int nindexes,
+					   int ndeaditems)
+{
+	bool vacuum_heap = true;
+	int i;
+
+	/*
+	 * Ask each index the vacuum strategy, and save them. If even on index
+	 * returns 'none', we can skip heap vacuum in this cycle at least from
+	 * the index strategies point of view. The consequence might be overwritten
+	 * by other factors, see below.
+	 */
+	for (i = 0; i < nindexes; i++)
+	{
+		IndexVacuumInfo ivinfo;
+
+		ivinfo.index = Irel[i];
+		ivinfo.message_level = elevel;
+
+		/* Save the returned value */
+		vacrelstats->ivstrategies[i] = index_vacuum_strategy(&ivinfo, params);
+
+		if (vacrelstats->ivstrategies[i] == INDEX_VACUUM_STRATEGY_NONE)
+			vacuum_heap = false;
+	}
+
+	/* If index cleanup option is specified, overwrite the consequence */
+	if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+		vacuum_heap = true;
+	else if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		vacuum_heap = false;
+	else if (!vacuum_heap)
+	{
+		Size freespace = RelationGetTargetPageFreeSpace(onerel,
+														HEAP_DEFAULT_FILLFACTOR);
+		int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) *
+									  DEAD_ITEMS_ON_PAGE_LIMIT_SAFETY_RATIO);
+
+		/*
+		 * Check whether we need to delete the collected garbage from the heap,
+		 * from the heap point of view.
+		 *
+		 * The test of ndeaditems_limit is for the maximum number of LP_DEAD
+		 * items on any one heap page encountered during heap scan by caller.
+		 * The general idea here is to preserve the original pristine state of
+		 * the table when it is subject to constant non-HOT updates when heap
+		 * fill factor is reduced from its default.
+		 *
+		 * To calculate how many LP_DEAD line pointers can be stored into the
+		 * space of a heap page left by fillfactor, we need to consider it from
+		 * two aspects: the size left by fillfactor and the maximum number of
+		 * heap tuples per pages, i.e., MaxHeapTuplesPerPage.  ndeaditems_limit
+		 * is calculated by using the freespace left by fillfactor -- we can fit
+		 * (freespace / sizeof(ItemIdData)) LP_DEAD items on a heap page before
+		 * they start to "overflow" with that setting, from the perspective of
+		 *  the space.  However, we cannot always store the calculated number of
+		 * LP_DEAD line pointers because of MaxHeapTuplesPerPage -- the total
+		 * number of line pointers in a heap page cannot exceed
+		 * MaxHeapTuplesPerPage. For example, with the small tuples, we can store
+		 * the more tuples in a heap page, meaning consuming the more free line
+		 * pointers to store heap tuples. So leaving line pointers as LP_DEAD
+		 * could consume line pointers that are supposed to store heap tuples,
+		 * resulting in an overflow.
+		 *
+		 * The below calculation, however, considers only the former aspect,
+		 * the space, because (1) MaxHeapTuplesPerPage is defined while
+		 * considering to accumulate a certain amount of LP_DEAD line pointers
+		 * and (2) to simplify the calculation. Thanks to (1) we don't need to
+		 * consider the upper bound in most cases.
+		 */
+		if (ndeaditems > ndeaditems_limit)
+			vacuum_heap = true;
+	}
+
+	vacrelstats->vacuum_heap = vacuum_heap;
+}
+
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *
@@ -1818,7 +1977,8 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 
 		for (idx = 0; idx < nindexes; idx++)
 			lazy_vacuum_index(Irel[idx], &stats[idx], vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
+							  vacrelstats->old_live_tuples, vacrelstats,
+							  vacrelstats->ivstrategies[idx]);
 	}
 
 	/* Increase and report the number of index scans */
@@ -1827,7 +1987,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
  *	lazy_vacuum_heap() -- second pass over the heap
  *
@@ -2092,7 +2251,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 							 LVRelStats *vacrelstats, LVParallelState *lps,
 							 int nindexes)
 {
-	int			nworkers;
+	int			nworkers = 0;
 
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
@@ -2108,10 +2267,32 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 			nworkers = lps->nindexes_parallel_cleanup;
 	}
 	else
-		nworkers = lps->nindexes_parallel_bulkdel;
+	{
+		if (vacrelstats->vacuum_heap)
+			nworkers = lps->nindexes_parallel_bulkdel;
+		else
+		{
+			int i;
+
+			/*
+			 * If we don't vacuum heap, index bulk-deletion could be skipped
+			 * depending on indexes. So we calculate how many indexes will do
+			 * index bulk-deletion based on the answers to amvacuumstrategy.
+			 */
+			for (i = 0; i < nindexes; i++)
+			{
+				uint8 vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+
+				if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0 &&
+					vacrelstats->ivstrategies[i] == INDEX_VACUUM_STRATEGY_BULKDELETE)
+					nworkers++;
+			}
+		}
+	}
 
 	/* The leader process will participate */
-	nworkers--;
+	if (nworkers > 0)
+		nworkers--;
 
 	/*
 	 * It is possible that parallel context is initialized with fewer workers
@@ -2120,6 +2301,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 	 */
 	nworkers = Min(nworkers, lps->pcxt->nworkers);
 
+	/* Copy the information to the shared state */
+	lps->lvshared->vacuum_heap = vacrelstats->vacuum_heap;
+	lps->lvshared->indexcleanup_requested = vacrelstats->indexcleanup_requested;
+
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
@@ -2254,7 +2439,8 @@ parallel_vacuum_index(Relation *Irel, IndexBulkDeleteResult **stats,
 
 		/* Do vacuum or cleanup of the index */
 		vacuum_one_index(Irel[idx], &(stats[idx]), lvshared, shared_indstats,
-						 dead_tuples, vacrelstats);
+						 dead_tuples, vacrelstats,
+						 vacrelstats->ivstrategies[idx]);
 	}
 
 	/*
@@ -2295,7 +2481,7 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
 			vacuum_one_index(Irel[i], &(stats[i]), lps->lvshared,
 							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
+							 vacrelstats, vacrelstats->ivstrategies[i]);
 	}
 
 	/*
@@ -2315,7 +2501,8 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 static void
 vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
+				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+				 IndexVacuumStrategy ivstrat)
 {
 	IndexBulkDeleteResult *bulkdelete_res = NULL;
 
@@ -2338,7 +2525,7 @@ vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 						   lvshared->estimated_count, vacrelstats);
 	else
 		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
+						  lvshared->reltuples, vacrelstats, ivstrat);
 
 	/*
 	 * Copy the index bulk-deletion result returned from ambulkdelete and
@@ -2429,7 +2616,8 @@ lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
  */
 static void
 lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+				  IndexVacuumStrategy ivstrat)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2443,7 +2631,9 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vac_strategy; /* buffer access strategy */
+	ivinfo.will_vacuum_heap = vacrelstats->vacuum_heap;
+	ivinfo.indvac_strategy = ivstrat; /* index vacuum strategy */
 
 	/*
 	 * Update error traceback information.
@@ -2461,11 +2651,17 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) dead_tuples);
 
-	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+	/*
+	 * Report the index bulk-deletion stats. If the index returns the
+	 * statistics and we will do vacuum heap, we can assume it have
+	 * done the index bulk-deletion.
+	 */
+	if (*stats && vacrelstats->vacuum_heap)
+		ereport(elevel,
+				(errmsg("scanned index \"%s\" to remove %d row versions",
+						vacrelstats->indname,
+						dead_tuples->num_tuples),
+				 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
@@ -2498,6 +2694,7 @@ lazy_cleanup_index(Relation indrel,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.vacuumcleanup_requested = vacrelstats->indexcleanup_requested;
 
 	/*
 	 * Update error traceback information.
@@ -2844,14 +3041,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2876,18 +3073,21 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+				 int nindexes)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
 	vacrelstats->dead_tuples = dead_tuples;
+	vacrelstats->ivstrategies =
+		(IndexVacuumStrategy *) palloc0(SizeOfIndVacStrategies(nindexes));
 }
 
 /*
@@ -3223,10 +3423,12 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	LVDeadTuples *dead_tuples;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	IndexVacuumStrategy *ivstrats;
 	bool	   *can_parallel_vacuum;
 	long		maxtuples;
 	Size		est_shared;
 	Size		est_deadtuples;
+	Size		est_ivstrategies;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -3320,6 +3522,13 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Estimate space for IndexVacuumStrategy -- PARALLEL_VACUUM_KEY_IND_STRATEGY.
+	 */
+	est_ivstrategies = MAXALIGN(SizeOfIndVacStrategies(nindexes));
+	shm_toc_estimate_chunk(&pcxt->estimator, est_ivstrategies);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
 	{
@@ -3372,6 +3581,11 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	lps->wal_usage = wal_usage;
 
+	/* Allocate space for each index strategy */
+	ivstrats = shm_toc_allocate(pcxt->toc, est_ivstrategies);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, ivstrats);
+	vacrelstats->ivstrategies = ivstrats;
+
 	/* Store query string for workers */
 	if (debug_query_string)
 	{
@@ -3507,6 +3721,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	LVShared   *lvshared;
 	LVDeadTuples *dead_tuples;
+	IndexVacuumStrategy *ivstrats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -3548,6 +3763,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	/* Set vacuum strategy space */
+	ivstrats = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, false);
+	vacrelstats.ivstrategies = ivstrats;
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3573,6 +3792,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacrelstats.indname = NULL;
 	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
 
+	vacrelstats.vacuum_heap = lvshared->vacuum_heap;
+	vacrelstats.indexcleanup_requested = lvshared->indexcleanup_requested;
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrelstats;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..171ba5c2fa 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -678,6 +678,28 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - ask index vacuum strategy
+ *
+ * This callback routine is called just before vacuuming the heap.
+ * Returns IndexVacuumStrategy value to tell the lazy vacuum whether to
+ * do index deletion.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info, struct VacuumParams *params)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+
+	/* amvacuumstrategy is optional; assume do bulk-deletion */
+	if (indexRelation->rd_indam->amvacuumstrategy == NULL)
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	return indexRelation->rd_indam->amvacuumstrategy(info, params);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..e00e5fe0a4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -822,6 +823,18 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 */
 		result = true;
 	}
+	else if (!info->vacuumcleanup_requested)
+	{
+		/*
+		 * Skip cleanup if INDEX_CLEANUP is set to false, even if there might
+		 * be a deleted page that can be recycled. If INDEX_CLEANUP continues
+		 * to be disabled, recyclable pages could be left by XID wraparound.
+		 * But in practice it's not so harmful since such workload doesn't need
+		 * to delete and recycle pages in any case and deletion of btree index
+		 * pages is relatively rare.
+		 */
+		result = false;
+	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
@@ -864,6 +877,19 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -878,6 +904,14 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Relation	rel = info->index;
 	BTCycleId	cycleid;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..7b2313590a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..f44043d94f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,19 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -907,6 +920,13 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 {
 	spgBulkDeleteState bds;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -937,8 +957,11 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	spgBulkDeleteState bds;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..30b48d6ccb 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3401,6 +3401,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.will_vacuum_heap = true;
+	ivinfo.indvac_strategy = INDEX_VACUUM_STRATEGY_BULKDELETE;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..111addbd6c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -668,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.vacuumcleanup_requested = true;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 462f9a0f82..4ab20b77e6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1870,17 +1870,20 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option if vacuum_index_cleanup reloption is set.
+	 * Otherwise we leave it as 'default', which means that we choose vacuum
+	 * strategy based on the table and index status. See choose_vacuum_strategy().
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
-	/* Set truncate option based on reloptions if not yet */
+	/*
+	 * Set truncate option based on reloptions if not yet. Truncate option
+	 * is true by default.
+	 */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
 	{
 		if (onerel->rd_options == NULL ||
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..548f2033a4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -22,8 +22,9 @@
 struct PlannerInfo;
 struct IndexPath;
 
-/* Likewise, this file shouldn't depend on execnodes.h. */
+/* Likewise, this file shouldn't depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 
 /*
@@ -112,6 +113,9 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   IndexUniqueCheck checkUnique,
 								   bool indexUnchanged,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info,
+														  struct VacuumParams *params);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -259,6 +263,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..f164ec1a54 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -21,8 +21,9 @@
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-/* We don't want this file to depend on execnodes.h. */
+/* We don't want this file to depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 /*
  * Struct for statistics returned by ambuild
@@ -33,8 +34,17 @@ typedef struct IndexBuildResult
 	double		index_tuples;	/* # of tuples inserted into index */
 } IndexBuildResult;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -50,6 +60,26 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+
+	/*
+	 * True if lazy vacuum delete the collected garbage tuples from the
+	 * heap.  If it's false, the index AM can skip index bulk-deletion
+	 * safely.  This field is used only for ambulkdelete.
+	 */
+	bool		will_vacuum_heap;
+
+	/*
+	 * The answer to amvacuumstrategy asked before executing ambulkdelete.
+	 * This field is used only for ambulkdelete.
+	 */
+	IndexVacuumStrategy indvac_strategy;
+
+	/*
+	 * amvacuumcleanup is requested by lazy vacuum. If false, the index AM
+	 * can skip index cleanup. This can be false if INDEX_CLEANUP vacuum option
+	 * is set to false. This field is used only for amvacuumcleanup.
+	 */
+	bool		vacuumcleanup_requested;
 } IndexVacuumInfo;
 
 /*
@@ -174,6 +204,8 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info,
+												 struct VacuumParams *params);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..5c48a48917 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -397,6 +397,8 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..303a18da4d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -533,6 +533,8 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 1cce865be2..4c7e064708 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -372,6 +372,8 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7c62852e7f..9615194db6 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,24 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)
  *
- * Note: with HOT, there could theoretically be more line pointers (not actual
- * tuples) than this on a heap page.  However we constrain the number of line
- * pointers to this anyway, to avoid excessive line-pointer bloat and not
- * require increases in the size of work arrays.
+ * We used to constrain the number of line pointers to avlid excessive
+ * line-pointer bloat and not require increases in the size of work arrays,
+ * calculating it using by the size of aligned heap tuple header. But since
+ * index vacuum strategy had entered the picture, accumulating LP_DEAD line
+ * pointers in a heap page has a value for skipping index deletion. So we
+ * relaxed the limitation by considering a certain number of line pointers in
+ * a heap page that don't have heap tuples, calculating it using by 1
+ * MAXALIGN() quantum instead of the aligned size of heap tuple header, 3
+ * MAXALIGN() quantums.
+ *
+ * Please note that increasing this values also affects TID bitmap. There
+ * might be a risk of intrducing performance regression affecting bitmap scans.
  */
 #define MaxHeapTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-			(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+			(MAXIMUM_ALIGNOF + sizeof(ItemIdData))))
 
 /*
  * MaxAttrSize is a somewhat arbitrary upper limit on the declared size of
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..ba120d4a80 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1011,6 +1011,8 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 2eb2f421a8..f591b21ef1 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -212,6 +212,8 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 191cbbd004..f2590c3b6e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -184,19 +185,6 @@ typedef struct VacAttrStats
 #define VACOPT_SKIPTOAST 0x40	/* don't process the TOAST table, if any */
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
 
-/*
- * A ternary value used by vacuum parameters.
- *
- * DEFAULT value is used to determine the value based on other
- * configurations, e.g. reloptions.
- */
-typedef enum VacOptTernaryValue
-{
-	VACOPT_TERNARY_DEFAULT = 0,
-	VACOPT_TERNARY_DISABLED,
-	VACOPT_TERNARY_ENABLED,
-} VacOptTernaryValue;
-
 /*
  * Parameters customizing behavior of VACUUM and ANALYZE.
  *
@@ -216,8 +204,10 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup. In
+										 * default mode, it's decided based on
+										 * multiple factors. See
+										 * choose_vacuum_strategy. */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..168dc5d466 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,20 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+/*
+ * A ternary value used by vacuum parameters. This value also is used
+ * for VACUUM command options.
+ *
+ * DEFAULT value is used to determine the value based on other
+ * configurations, e.g. reloptions.
+ */
+typedef enum VacOptTernaryValue
+{
+	VACOPT_TERNARY_DEFAULT = 0,
+	VACOPT_TERNARY_DISABLED,
+	VACOPT_TERNARY_ENABLED,
+} VacOptTernaryValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +318,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptTernaryValue	vacuum_index_cleanup;	/* enables index vacuuming
+												 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 } StdRdOptions;
 
diff --git a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
index 4d0beaecea..8ad3e998e1 100644
--- a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
+++ b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
@@ -6,11 +6,11 @@ CREATE EXTENSION test_ginpostinglist;
 SELECT test_ginpostinglist();
 NOTICE:  testing with (0, 1), (0, 2), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (0, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (0, 680), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (4294967294, 680), max 14 bytes
 NOTICE:  encoded 1 item pointers to 8 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 16 bytes
+NOTICE:  testing with (0, 1), (4294967294, 680), max 16 bytes
 NOTICE:  encoded 2 item pointers to 16 bytes
  test_ginpostinglist 
 ---------------------
-- 
2.27.0

#21

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Zhihong Yu (#13)

Re: New IndexAM API controlling index vacuum strategies

(Please avoid top-posting on the mailing lists[1]https://en.wikipedia.org/wiki/Posting_style#Top-posting: top-posting breaks
the logic of a thread.)

On Tue, Jan 19, 2021 at 12:02 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi, Masahiko-san:

Thank you for reviewing the patch!

For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch :

For blvacuumstrategy():
+   if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+       return INDEX_VACUUM_STRATEGY_NONE;
+   else
+       return INDEX_VACUUM_STRATEGY_BULKDELETE;
The 'else' can be omitted.

Yes, but I'd prefer to leave it as it is because it's more readable
without any performance side effect that we return BULKDELETE if index
cleanup is enabled.

Similar comment for ginvacuumstrategy().

For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch :

If index_cleanup option is specified neither VACUUM command nor
storage option

I think this is what you meant (by not using passive voice):

If index_cleanup option specifies neither VACUUM command nor
storage option,
- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 13 ought to be more
It would be nice to add a note why the number of bits is increased.

I think that it might be better to mention such update history in the
commit log rather than in the source code. Because most readers are
likely to be interested in why 12 bits for offset number is enough,
rather than why this value has been increased. In the source code
comment, we describe why 12 bits for offset number is enough. We can
mention in the commit log that since the commit changes
MaxHeapTuplesPerPage the encoding in gin posting list is also changed.
What do you think?

For choose_vacuum_strategy():

+ IndexVacuumStrategy ivstrat;

The variable is only used inside the loop. You can use vacrelstats->ivstrategies[i] directly and omit the variable.

Fixed.

+ int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7);

How was the factor of 0.7 determined ? Comment below only mentioned 'safety factor' but not how it was chosen.
I also wonder if this factor should be exposed as GUC.

Fixed.

+ if (nworkers > 0)
+ nworkers--;

Should log / assert be added when nworkers is <= 0 ?

Hmm I don't think so. As far as I read the code, there is no
possibility that nworkers can be lower than 0 (we always increment it)
and actually, the code works fine even if nworkers is a negative
value.

+ * XXX: allowing to fill the heap page with only line pointer seems a overkill.

'a overkill' -> 'an overkill'

Fixed.

The above comments are incorporated into the latest patch I just posted[2]/messages/by-id/CAD21AoCS94vK1fs-_=R5J3tp2DsZPq9zdcUu2pk6fbr7BS7quA@mail.gmail.com.

[1]: https://en.wikipedia.org/wiki/Posting_style#Top-posting
[2]: /messages/by-id/CAD21AoCS94vK1fs-_=R5J3tp2DsZPq9zdcUu2pk6fbr7BS7quA@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#22

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#20)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Jan 25, 2021 at 5:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 21, 2021 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

After more thought, I think that ambulkdelete needs to be able to
refer the answer to amvacuumstrategy. That way, the index can skip
bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
want to do that.

Makes sense.

BTW, your patch has bitrot already. Peter E's recent pageinspect
commit happens to conflict with this patch. It might make sense to
produce a new version that just fixes the bitrot, so that other people
don't have to deal with it each time.

I’ve attached the updated version patch that includes the following changes:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

Thank you for reviewing the patches.

Some feedback on the first patch:

* Just so you know: I agree with you about handling
VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
think that it's better to do that there, even though this choice may
have some downsides.

* Can you add some "stub" sgml doc changes for this? Doesn't have to
be complete in any way. Just a placeholder for later, that has the
correct general "shape" to orientate the reader of the patch. It can
just be a FIXME comment, plus basic mechanical stuff -- details of the
new amvacuumstrategy_function routine and its signature.

0002 patch had the doc update (I mistakenly included it to 0002
patch). Is that update what you meant?

Some feedback on the second patch:

* Why do you move around IndexVacuumStrategy in the second patch?
Looks like a rebasing oversight.

Check.

* Actually, do we really need the first and second patches to be
separate patches? I agree that the nbtree patch should be a separate
patch, but dividing the first two sets of changes doesn't seem like it
adds much. Did I miss some something?

I separated the patches since I used to have separate patches when
adding other index AM options required by parallel vacuum. But I
agreed to merge the first two patches.
* Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
MaxHeapTuplesPerPage appropriate? Here is the relevant section from
the patch:
diff --git a/src/include/access/htup_details.h
b/src/include/access/htup_details.h
index 7c62852e7f..038e7cd580 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,18 @@ do { \
/*
* MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
* fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)  We arrive at the divisor because each line
+ * pointer must be maxaligned.
*** SNIP ***
#define MaxHeapTuplesPerPage    \
-    ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-            (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+    ((int) ((BLCKSZ - SizeOfPageHeaderData) / (MAXALIGN(sizeof(ItemIdData)))))
It's true that ItemIdData structs (line pointers) are aligned, but
they're not MAXALIGN()'d. If they were then the on-disk size of line
pointers would generally be 8 bytes, not 4 bytes.
You're right. Will fix.

* Maybe it would be better if you just changed the definition such
that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
no other changes? (Some variant of this suggestion might be better,
not sure.)

For some reason that feels a bit safer: we still have an "imaginary
tuple header", but it's just 1 MAXALIGN() quantum now. This is still
much less than the current 3 MAXALIGN() quantums (i.e. what
MaxHeapTuplesPerPage treats as the tuple header size). Do you think
that this alternative approach will be noticeably less effective
within vacuumlazy.c?

Note that you probably understand the issue with MaxHeapTuplesPerPage
for vacuumlazy.c better than I do currently. I'm still trying to
understand your choices, and to understand what is really important
here.

Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
need to discuss how to deal with that.

* Maybe add a #define for the value 0.7? (I refer to the value used in
choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
line pointers that we consider too many" cut off point, which is to be
applied throughout lazy_scan_heap() processing.)

Agreed.

* I notice that your new lazy_vacuum_table_and_indexes() function is
the only place that calls lazy_vacuum_table_and_indexes(). I think
that you should merge them together -- replace the only remaining call
to lazy_vacuum_table_and_indexes() with the body of the function
itself. Having a separate lazy_vacuum_table_and_indexes() function
doesn't seem useful to me -- it doesn't actually hide complexity, and
might even be harder to maintain.

lazy_vacuum_table_and_indexes() is called at two places: after
maintenance_work_mem run out (around L1097) and the end of
lazy_scan_heap() (around L1726). I defined this function to pack the
operations such as choosing a strategy, vacuuming indexes and
vacuuming heap. Without this function, will we end up writing the same
codes twice there?

* I suggest thinking about what the last item will mean for the
reporting that currently takes place in
lazy_vacuum_table_and_indexes(), but will now go in an expanded
lazy_vacuum_table_and_indexes() -- how do we count the total number of
index scans now?

I don't actually believe that the logic needs to change, but some kind
of consolidation and streamlining seems like it might be helpful.
Maybe just a comment that says "note that all index scans might just
be no-ops because..." -- stuff like that.

What do you mean by the last item and what report? I think
lazy_vacuum_table_and_indexes() itself doesn't report anything and
vacrelstats->num_index_scan counts the total number of index scans.

* Any idea about how hard it will be to teach is_wraparound VACUUMs to
skip index cleanup automatically, based on some practical/sensible
criteria?

One simple idea would be to have a to-prevent-wraparound autovacuum
worker disables index cleanup (i.g., setting VACOPT_TERNARY_DISABLED
to index_cleanup). But a downside (but not a common case) is that
since a to-prevent-wraparound vacuum is not necessarily an aggressive
vacuum, it could skip index cleanup even though it cannot move
relfrozenxid forward.

It would be nice to have a basic PoC for that, even if it remains a
PoC for the foreseeable future (i.e. even if it cannot be committed to
Postgres 14). This feature should definitely be something that your
patch series *enables*. I'd feel good about having covered that
question as part of this basic design work if there was a PoC. That
alone should make it 100% clear that it's easy to do (or no harder
than it should be -- it should ideally be compatible with your basic
design). The exact criteria that we use for deciding whether or not to
skip index cleanup (which probably should not just be "this VACUUM is
is_wraparound, good enough" in the final version) may need to be
debated at length on pgsql-hackers. Even still, it is "just a detail"
in the code. Whereas being *able* to do that with your design (now or
in the future) seems essential now.

Agreed. I'll write a PoC patch for that.

* Store the answers to amvacuumstrategy into either the local memory
or DSM (in parallel vacuum case) so that ambulkdelete can refer the
answer to amvacuumstrategy.
* Fix regression failures.
* Update the documentation and commments.

Note that 0003 patch is still PoC quality, lacking the btree meta page
version upgrade.

This patch is not the hard part, of course -- there really isn't that
much needed here compared to vacuumlazy.c. So this patch seems like
the simplest 1 out of the 3 (at least to me).

Some feedback on the third patch:

* The new btm_last_deletion_nblocks metapage field should use P_NONE
(which is 0) to indicate never having been vacuumed -- not
InvalidBlockNumber (which is 0xFFFFFFFF).

This is more idiomatic in nbtree, which is nice, but it has a very
significant practical advantage: it ensures that every heapkeyspace
nbtree index (i.e. those on recent nbtree versions) can be treated as
if it has the new btm_last_deletion_nblocks field all along, even when
it actually built on Postgres 12 or 13. This trick will let you avoid
dealing with the headache of bumping BTREE_VERSION, which is a huge
advantage.

Note that this is the same trick I used to avoid bumping BTREE_VERSION
when the btm_allequalimage field needed to be added (for the nbtree
deduplication feature added to Postgres 13).

That's a nice way with a great advantage. I'll use P_NONE.
* Forgot to do this in the third patch (think I made this same mistake
once myself):
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..88dfea9af3 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
BTMetaPageData *metad;
TupleDesc   tupleDesc;
int         j;
-    char       *values[9];
+    char       *values[10];
Buffer      buffer;
Page        page;
HeapTuple   tuple;
@@ -734,6 +734,11 @@ bt_metap(PG_FUNCTION_ARGS)
Check.

I'm updating and testing the patch. I'll submit the updated version
patches tomorrow.
Sorry for the late.

I've attached the updated version patch that incorporated the comments
I got so far.

I merged the previous 0001 and 0002 patches. 0003 patch is now another
PoC patch that disables index cleanup automatically when autovacuum is
to prevent xid-wraparound and an aggressive vacuum.

Since I found some bugs in the v3 patch I attached the updated version patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v4-0003-PoC-disable-index-cleanup-when-an-anti-wraparound.patchapplication/octet-stream; name=v4-0003-PoC-disable-index-cleanup-when-an-anti-wraparound.patchDownload

From 3f4ab89c338ad7be4b09b407ff8541f642d7b1ff Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Jan 2021 16:20:37 +0900
Subject: [PATCH v4 3/3] PoC: disable index cleanup when an anti-wraparound and
 aggressive vacuum.

---
 src/backend/access/heap/vacuumlazy.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 079359951e..ea074ad11d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -530,6 +530,23 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
+	/*
+	 * If the vacuum is initiated to prevent xid-wraparound and is an aggressive
+	 * scan, we disable index cleanup to make freezing heap tuples and moving
+	 * relfrozenxid forward complete faster.
+	 *
+	 * Note that this applies only autovacuums as is_wraparound can be true
+	 * in autovacuums.
+	 *
+	 * XXX: should we not disable index cleanup if vacuum_index_cleanup reloption
+	 * is on?
+	 */
+	if (aggressive && params->is_wraparound)
+	{
+		Assert(IsAutoVacuumWorkerProcess());
+		params->index_cleanup = VACOPT_TERNARY_DISABLED;
+	}
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-- 
2.27.0

v4-0002-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchapplication/octet-stream; name=v4-0002-Skip-btree-bulkdelete-if-the-index-doesn-t-grow.patchDownload

From dbf0ae552ce0907281951b10deed7a1e90f90fe9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 5 Jan 2021 09:47:49 +0900
Subject: [PATCH v4 2/3] Skip btree bulkdelete if the index doesn't grow.

On amvacuumstrategy, btree indexes returns INDEX_VACUUM_STRATEGY_NONE
if the index doesn't grow since last bulk-deletion. To remember that,
this change adds a new filed in the btree meta page to store the
number of blocks last bulkdelete time.

No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes. A new field,
btm_last_deletion_nblocks, is P_NONE, 0, if not set yet.
---
 contrib/pageinspect/btreefuncs.c              |  4 ++-
 contrib/pageinspect/expected/btree.out        |  1 +
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 18 +++++++++++
 src/backend/access/nbtree/nbtpage.c           |  9 +++++-
 src/backend/access/nbtree/nbtree.c            | 31 ++++++++++++++++---
 src/backend/access/nbtree/nbtxlog.c           |  1 +
 src/backend/access/rmgrdesc/nbtdesc.c         |  5 +--
 src/include/access/nbtree.h                   |  3 ++
 src/include/access/nbtxlog.h                  |  1 +
 9 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8bb180bbbe..30b1892222 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -653,7 +653,7 @@ bt_metap(PG_FUNCTION_ARGS)
 	BTMetaPageData *metad;
 	TupleDesc	tupleDesc;
 	int			j;
-	char	   *values[9];
+	char	   *values[10];
 	Buffer		buffer;
 	Page		page;
 	HeapTuple	tuple;
@@ -726,12 +726,14 @@ bt_metap(PG_FUNCTION_ARGS)
 		values[j++] = psprintf("%u", metad->btm_oldest_btpo_xact);
 		values[j++] = psprintf("%f", metad->btm_last_cleanup_num_heap_tuples);
 		values[j++] = metad->btm_allequalimage ? "t" : "f";
+		values[j++] = psprintf(INT64_FORMAT, (int64) metad->btm_last_deletion_nblocks);
 	}
 	else
 	{
 		values[j++] = "0";
 		values[j++] = "-1";
 		values[j++] = "f";
+		values[j++] = "0";
 	}
 
 	tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index a7632be36a..ae1aea8a6f 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -13,6 +13,7 @@ fastlevel               | 0
 oldest_xact             | 0
 last_cleanup_num_tuples | -1
 allequalimage           | t
+last_deletion_nblocks   | 0
 
 SELECT * FROM bt_page_stats('test1_a_idx', -1);
 ERROR:  invalid block number
diff --git a/contrib/pageinspect/pageinspect--1.8--1.9.sql b/contrib/pageinspect/pageinspect--1.8--1.9.sql
index b4248d791f..63725f8522 100644
--- a/contrib/pageinspect/pageinspect--1.8--1.9.sql
+++ b/contrib/pageinspect/pageinspect--1.8--1.9.sql
@@ -116,3 +116,21 @@ CREATE FUNCTION brin_page_items(IN page bytea, IN index_oid regclass,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'brin_page_items'
 LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_metap()
+--
+DROP FUNCTION bt_metap(text);
+CREATE FUNCTION bt_metap(IN relname text,
+    OUT magic int4,
+    OUT version int4,
+    OUT root int8,
+    OUT level int8,
+    OUT fastroot int8,
+    OUT fastlevel int8,
+    OUT oldest_xact xid,
+    OUT last_cleanup_num_tuples float8,
+    OUT allequalimage boolean,
+    OUT last_deletion_nblocks int8)
+AS 'MODULE_PATHNAME', 'bt_metap'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index e230f912c2..0a16e9db9b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -82,6 +82,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
 	metad->btm_allequalimage = allequalimage;
+	metad->btm_last_deletion_nblocks = P_NONE;
 
 	metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	metaopaque->btpo_flags = BTP_META;
@@ -121,6 +122,7 @@ _bt_upgrademetapage(Page page)
 	metad->btm_version = BTREE_NOVAC_VERSION;
 	metad->btm_oldest_btpo_xact = InvalidTransactionId;
 	metad->btm_last_cleanup_num_heap_tuples = -1.0;
+
 	/* Only a REINDEX can set this field */
 	Assert(!metad->btm_allequalimage);
 	metad->btm_allequalimage = false;
@@ -185,17 +187,20 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	BTMetaPageData *metad;
 	bool		needsRewrite = false;
 	XLogRecPtr	recptr;
+	BlockNumber nblocks;
 
 	/* read the metapage and check if it needs rewrite */
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
 	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(rel);
 
 	/* outdated version of metapage always needs rewrite */
 	if (metad->btm_version < BTREE_NOVAC_VERSION)
 		needsRewrite = true;
 	else if (metad->btm_oldest_btpo_xact != oldestBtpoXact ||
-			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples)
+			 metad->btm_last_cleanup_num_heap_tuples != numHeapTuples ||
+			 metad->btm_last_deletion_nblocks != nblocks)
 		needsRewrite = true;
 
 	if (!needsRewrite)
@@ -217,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 	/* update cleanup-related information */
 	metad->btm_oldest_btpo_xact = oldestBtpoXact;
 	metad->btm_last_cleanup_num_heap_tuples = numHeapTuples;
+	metad->btm_last_deletion_nblocks = nblocks;
 	MarkBufferDirty(metabuf);
 
 	/* write wal record if needed */
@@ -236,6 +242,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
 		md.oldest_btpo_xact = oldestBtpoXact;
 		md.last_cleanup_num_heap_tuples = numHeapTuples;
 		md.allequalimage = metad->btm_allequalimage;
+		md.last_deletion_nblocks = metad->btm_last_deletion_nblocks;
 
 		XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e00e5fe0a4..e8e7bd76e1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -878,16 +878,39 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 }
 
 /*
- * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
- * is specified to off.
+ * Choose the vacuum strategy. Do bulk-deletion or nothing
  */
 IndexVacuumStrategy
 btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
 {
+	Buffer		metabuf;
+	Page		metapg;
+	BTMetaPageData *metad;
+	BlockNumber	nblocks;
+	IndexVacuumStrategy result = INDEX_VACUUM_STRATEGY_NONE;
+
+	/*
+	 * Don't want to do bulk-deletion if index cleanup is disabled
+	 * by the user request.
+	 */
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 		return INDEX_VACUUM_STRATEGY_NONE;
-	else
-		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	metabuf = _bt_getbuf(info->index, BTREE_METAPAGE, BT_READ);
+	metapg = BufferGetPage(metabuf);
+	metad = BTPageGetMeta(metapg);
+	nblocks = RelationGetNumberOfBlocks(info->index);
+
+	/*
+	 * Do deletion if the index grows since the last deletion by
+	 * even one block or for the first time.
+	 */
+	if (metad->btm_last_deletion_nblocks == P_NONE ||
+		nblocks > metad->btm_last_deletion_nblocks)
+		result = INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	_bt_relbuf(info->index, metabuf);
+	return result;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c1d578cc01..37546f566d 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -115,6 +115,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
 	md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
 	md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
 	md->btm_allequalimage = xlrec->allequalimage;
+	md->btm_last_deletion_nblocks = xlrec->last_deletion_nblocks;
 
 	pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
 	pageop->btpo_flags = BTP_META;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 6e0d6a2b72..4e58b0bc07 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -110,9 +110,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 
 				xlrec = (xl_btree_metadata *) XLogRecGetBlockData(record, 0,
 																  NULL);
-				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f",
+				appendStringInfo(buf, "oldest_btpo_xact %u; last_cleanup_num_heap_tuples: %f; last_deletion_nblocks: %u",
 								 xlrec->oldest_btpo_xact,
-								 xlrec->last_cleanup_num_heap_tuples);
+								 xlrec->last_cleanup_num_heap_tuples,
+								 xlrec->last_deletion_nblocks);
 				break;
 			}
 	}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ba120d4a80..35c6858573 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -110,6 +110,9 @@ typedef struct BTMetaPageData
 	float8		btm_last_cleanup_num_heap_tuples;	/* number of heap tuples
 													 * during last cleanup */
 	bool		btm_allequalimage;	/* are all columns "equalimage"? */
+	BlockNumber	btm_last_deletion_nblocks;	/* number of blocks during last
+											 * bulk-deletion. P_NONE if not
+											 * set. */
 } BTMetaPageData;
 
 #define BTPageGetMeta(p) \
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 7ae5c98c2b..bc0c52a779 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -55,6 +55,7 @@ typedef struct xl_btree_metadata
 	TransactionId oldest_btpo_xact;
 	float8		last_cleanup_num_heap_tuples;
 	bool		allequalimage;
+	BlockNumber last_deletion_nblocks;
 } xl_btree_metadata;
 
 /*
-- 
2.27.0

v4-0001-Choose-vacuum-strategy-before-heap-and-index-vacu.patchapplication/octet-stream; name=v4-0001-Choose-vacuum-strategy-before-heap-and-index-vacu.patchDownload

From 024b7cc2869995510b1be564a7314f4522d7b1f7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 4 Jan 2021 13:34:10 +0900
Subject: [PATCH v4 1/3] Choose vacuum strategy before heap and index vacuums.

If index_cleanup option is specified neither VACUUM command nor
storage option, lazy vacuum asks each index the vacuum strategy before
heap vacuum and decides whether or not to remove the collected garbage
tuples from the heap based on both the answers of amvacuumstrategy, a
new index AM API introduced in this commit, and how many LP_DEAD items
can be accumlated in a space of heap page left by fillfactor.

The decision made by lazy vacuum and the answer returned from
amvacuumstrategy are passed to ambulkdelete. Then each index can
choose whether or not to skip index bulk-deletion accordingly.
---
 contrib/bloom/bloom.h                         |   2 +
 contrib/bloom/blutils.c                       |   1 +
 contrib/bloom/blvacuum.c                      |  23 +-
 doc/src/sgml/indexam.sgml                     |  25 ++
 doc/src/sgml/ref/create_table.sgml            |  19 +-
 src/backend/access/brin/brin.c                |   8 +-
 src/backend/access/common/reloptions.c        |  40 +-
 src/backend/access/gin/ginpostinglist.c       |  30 +-
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gin/ginvacuum.c            |  25 ++
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/gist/gistvacuum.c          |  28 +-
 src/backend/access/hash/hash.c                |  22 +
 src/backend/access/heap/vacuumlazy.c          | 401 ++++++++++++++----
 src/backend/access/index/indexam.c            |  22 +
 src/backend/access/nbtree/nbtree.c            |  34 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/access/spgist/spgvacuum.c         |  27 +-
 src/backend/catalog/index.c                   |   2 +
 src/backend/commands/analyze.c                |   1 +
 src/backend/commands/vacuum.c                 |  23 +-
 src/include/access/amapi.h                    |   7 +-
 src/include/access/genam.h                    |  36 +-
 src/include/access/gin_private.h              |   2 +
 src/include/access/gist_private.h             |   2 +
 src/include/access/hash.h                     |   2 +
 src/include/access/htup_details.h             |  21 +-
 src/include/access/nbtree.h                   |   2 +
 src/include/access/spgist.h                   |   2 +
 src/include/commands/vacuum.h                 |  20 +-
 src/include/utils/rel.h                       |  17 +-
 .../expected/test_ginpostinglist.out          |   6 +-
 32 files changed, 691 insertions(+), 162 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index a22a6dfa40..8395d31450 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -202,6 +202,8 @@ extern void blendscan(IndexScanDesc scan);
 extern IndexBuildResult *blbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void blbuildempty(Relation index);
+extern IndexVacuumStrategy blvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *blbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
 										   void *callback_state);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..8098d75c82 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,6 +131,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amvacuumstrategy = blvacuumstrategy;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 88b0a6d290..c356ec9e85 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -23,6 +23,19 @@
 #include "storage/lmgr.h"
 
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+blvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -45,6 +58,14 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BloomMetaPageData *metaData;
 	GenericXLogState *gxlogState;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
@@ -172,7 +193,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	BlockNumber npages,
 				blkno;
 
-	if (info->analyze_only)
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (stats == NULL)
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index ec5741df6d..9f881303f6 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -135,6 +135,7 @@ typedef struct IndexAmRoutine
     ambuild_function ambuild;
     ambuildempty_function ambuildempty;
     aminsert_function aminsert;
+    amvacuumstrategy_function amvacuumstrategy;
     ambulkdelete_function ambulkdelete;
     amvacuumcleanup_function amvacuumcleanup;
     amcanreturn_function amcanreturn;   /* can be NULL */
@@ -346,6 +347,30 @@ aminsert (Relation indexRelation,
 
   <para>
 <programlisting>
+IndexVacuumStrategy
+amvacuumstrategy (IndexVacuumInfo *info);
+</programlisting>
+   Tell <command>VACUUM</command> whether or not the index is willing to
+   delete index tuples.  This callback is called before
+   <function>ambulkdelete</function>.  Possible return values are
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal> and
+   <literal>INDEX_VACUUM_STRATEGY_BULKDELETE</literal>.  From the index
+   pont of view, if the index doesn't need to delete index tuple, it
+   must return <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.  The returned
+   value can be referred  when <function>ambulkdelete</function> by checking
+   <literal>info-&gt;indvac_strategy</literal>.
+  </para>
+  <para>
+   <command>VACUUM</command> will decide whether or not to delete garbage tuples
+   from the heap based on these returned values from each index and several other
+   factors.  Therefore, if the index refers to heap TID and <command>VACUUM</command>
+   decides to delete garbage tuples from the heap, please note that the index also
+   must delete index tuples even if it returned
+   <literal>INDEX_VACUUM_STRATEGY_NONE</literal>.
+  </para>
+
+  <para>
+<programlisting>
 IndexBulkDeleteResult *
 ambulkdelete (IndexVacuumInfo *info,
               IndexBulkDeleteResult *stats,
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 569f4c9da7..c45cdcb292 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1434,20 +1434,23 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
    </varlistentry>
 
    <varlistentry id="reloption-vacuum-index-cleanup" xreflabel="vacuum_index_cleanup">
-    <term><literal>vacuum_index_cleanup</literal>, <literal>toast.vacuum_index_cleanup</literal> (<type>boolean</type>)
+    <term><literal>vacuum_index_cleanup</literal>, <literal>toast.vacuum_index_cleanup</literal> (<type>enum</type>)
     <indexterm>
      <primary><varname>vacuum_index_cleanup</varname> storage parameter</primary>
     </indexterm>
     </term>
     <listitem>
      <para>
-      Enables or disables index cleanup when <command>VACUUM</command> is
-      run on this table.  The default value is <literal>true</literal>.
-      Disabling index cleanup can speed up <command>VACUUM</command> very
-      significantly, but may also lead to severely bloated indexes if table
-      modifications are frequent.  The <literal>INDEX_CLEANUP</literal>
-      parameter of <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified, overrides
-      the value of this option.
+      Specify index cleanup option when <command>VACUUM</command> is
+      run on this table.  The default value is <literal>auto</literal>, which
+      determines whether to enable or disable index cleanup based on the indexes
+      and the heap.  With <literal>off</literal> index cleanup is disabled, with
+      <literal>on</literal> it is enabled. Disabling index cleanup can speed up
+      <command>VACUUM</command> very significantly, but may also lead to severely
+      bloated indexes if table modifications are frequent.  The
+      <literal>INDEX_CLEANUP</literal> parameter of
+      <link linkend="sql-vacuum"><command>VACUUM</command></link>, if specified,
+      overrides the value of this option.
      </para>
     </listitem>
    </varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..fb70234112 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amvacuumstrategy = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -800,8 +801,11 @@ brinvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	heapRel;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	if (!stats)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c687d3ee9e..692455d617 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "commands/vacuum.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
 #include "postmaster/postmaster.h"
@@ -140,15 +141,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -492,6 +484,23 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/*
+ * values from VacOptTernaryValue for index_cleanup option.
+ * Allowing boolean values other than "on" and "off" are for
+ * backward compatibility as the option is used to be an
+ * boolean.
+ */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_TERNARY_DEFAULT},
+	{"true", VACOPT_TERNARY_ENABLED},
+	{"false", VACOPT_TERNARY_DISABLED},
+	{"on", VACOPT_TERNARY_ENABLED},
+	{"off", VACOPT_TERNARY_DISABLED},
+	{"1", VACOPT_TERNARY_ENABLED},
+	{"0", VACOPT_TERNARY_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -516,6 +525,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_TERNARY_DEFAULT,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1856,7 +1876,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)}
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 216b2b9a2c..0322a1736e 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -22,29 +22,29 @@
 
 /*
  * For encoding purposes, item pointers are represented as 64-bit unsigned
- * integers. The lowest 11 bits represent the offset number, and the next
- * lowest 32 bits are the block number. That leaves 21 bits unused, i.e.
- * only 43 low bits are used.
+ * integers. The lowest 12 bits represent the offset number, and the next
+ * lowest 32 bits are the block number. That leaves 20 bits unused, i.e.
+ * only 44 low bits are used.
  *
- * 11 bits is enough for the offset number, because MaxHeapTuplesPerPage <
- * 2^11 on all supported block sizes. We are frugal with the bits, because
+ * 12 bits is enough for the offset number, because MaxHeapTuplesPerPage <
+ * 2^12 on all supported block sizes. We are frugal with the bits, because
  * smaller integers use fewer bytes in the varbyte encoding, saving disk
  * space. (If we get a new table AM in the future that wants to use the full
  * range of possible offset numbers, we'll need to change this.)
  *
- * These 43-bit integers are encoded using varbyte encoding. In each byte,
+ * These 44-bit integers are encoded using varbyte encoding. In each byte,
  * the 7 low bits contain data, while the highest bit is a continuation bit.
  * When the continuation bit is set, the next byte is part of the same
- * integer, otherwise this is the last byte of this integer. 43 bits need
+ * integer, otherwise this is the last byte of this integer. 44 bits need
  * at most 7 bytes in this encoding:
  *
  * 0XXXXXXX
- * 1XXXXXXX 0XXXXYYY
- * 1XXXXXXX 1XXXXYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
- * 1XXXXXXX 1XXXXYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuuY
+ * 1XXXXXXX 0XXXXXYY
+ * 1XXXXXXX 1XXXXXYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0YYYYYYY
+ * 1XXXXXXX 1XXXXXYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 1YYYYYYY 0uuuuuYY
  *
  * X = bits used for offset number
  * Y = bits used for block number
@@ -73,12 +73,12 @@
 
 /*
  * How many bits do you need to encode offset number? OffsetNumber is a 16-bit
- * integer, but you can't fit that many items on a page. 11 ought to be more
+ * integer, but you can't fit that many items on a page. 12 ought to be more
  * than enough. It's tempting to derive this from MaxHeapTuplesPerPage, and
  * use the minimum number of bits, but that would require changing the on-disk
  * format if MaxHeapTuplesPerPage changes. Better to leave some slack.
  */
-#define MaxHeapTuplesPerPageBits		11
+#define MaxHeapTuplesPerPageBits		12
 
 /* Max. number of bytes needed to encode the largest supported integer. */
 #define MaxBytesPerInteger				7
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..fc375332fc 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -63,6 +63,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amvacuumstrategy = ginvacuumstrategy;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 35b85a9bff..68bec5238a 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -560,6 +560,19 @@ ginVacuumEntryPage(GinVacuumState *gvs, Buffer buffer, BlockNumber *roots, uint3
 	return (tmppage == origpage) ? NULL : tmppage;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+ginvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 IndexBulkDeleteResult *
 ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			  IndexBulkDeleteCallback callback, void *callback_state)
@@ -571,6 +584,14 @@ ginbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	BlockNumber rootOfPostingTree[BLCKSZ / (sizeof(IndexTupleData) + sizeof(ItemId))];
 	uint32		nRoot;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	gvs.tmpCxt = AllocSetContextCreate(CurrentMemoryContext,
 									   "Gin vacuum temporary context",
 									   ALLOCSET_DEFAULT_SIZES);
@@ -708,6 +729,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		return stats;
 	}
 
+	/* Skip index cleanup if user requests to disable */
+	if (!info->vacuumcleanup_requested)
+		return stats;
+
 	/*
 	 * Set up all-zero stats and cleanup pending inserts if ginbulkdelete
 	 * wasn't called
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f203bb594c..cddcdd83be 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,6 +84,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amvacuumstrategy = gistvacuumstrategy;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 94a7e12763..706454b2f0 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -52,6 +52,19 @@ static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						   Buffer buffer, OffsetNumber downlink,
 						   Buffer leafBuffer);
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+gistvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * VACUUM bulkdelete stage: remove index entries.
  */
@@ -59,6 +72,14 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -74,8 +95,11 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..0449638cb3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -81,6 +81,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amvacuumstrategy = hashvacuumstrategy;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
@@ -444,6 +445,19 @@ hashendscan(IndexScanDesc scan)
 	scan->opaque = NULL;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+hashvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -469,6 +483,14 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	HashMetaPage metap;
 	HashMetaPage cachedmetap;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	tuples_removed = 0;
 	num_index_tuples = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..079359951e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -130,6 +130,15 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * Safety ratio of how many LP_DEAD items can be stored in single heap
+ * page before it starts to overflow.  We're trying to avoid having VACUUM
+ * call lazy_vacuum_heap() in most cases, but we don't want to be too
+ * aggressive: it would be risky to make the value we test for much higher,
+ * since it might be too late by the time we actually call lazy_vacuum_heap().
+ */
+#define DEAD_ITEMS_ON_PAGE_LIMIT_SAFETY_RATIO	0.7
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -140,6 +149,7 @@
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
+#define PARALLEL_VACUUM_KEY_IND_STRATEGY	6
 
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
@@ -214,6 +224,18 @@ typedef struct LVShared
 	double		reltuples;
 	bool		estimated_count;
 
+	/*
+	 * Copy of LVRelStats.vacuum_cheap. It tells index AM that lazy vacuum
+	 * will remove dead tuples from the heap after index vacuum.
+	 */
+	bool vacuum_heap;
+
+	/*
+	 * Copy of LVRelStats.indexcleanup_requested. It tells index AM whether
+	 * amvacuumcleanup is requested or not.
+	 */
+	bool indexcleanup_requested;
+
 	/*
 	 * In single process lazy vacuum we could consume more memory during index
 	 * vacuuming or cleanup apart from the memory for heap scanning.  In
@@ -293,8 +315,8 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -313,6 +335,15 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
+	bool		vacuum_heap;	/* do we remove dead tuples from the heap? */
+	bool		indexcleanup_requested; /* INDEX_CLEANUP is set to false */
+
+	/*
+	 * The array of index vacuum strategies for each index returned from
+	 * amvacuumstrategy. This is allocated in the DSM segment in parallel
+	 * mode and in local memory in non-parallel mode.
+	 */
+	IndexVacuumStrategy *ivstrategies;
 
 	/* Used for error callback */
 	char	   *indname;
@@ -320,6 +351,8 @@ typedef struct LVRelStats
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
 } LVRelStats;
+#define SizeOfIndVacStrategies(nindexes) \
+	(mul_size(sizeof(IndexVacuumStrategy), (nindexes)))
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -343,6 +376,13 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+								   VacuumParams *params, Relation *Irel,
+								   int nindexes, int ndeaditems);
+static void lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+										  LVRelStats *vacrelstats, Relation *Irel,
+										  int nindexes, IndexBulkDeleteResult **stats,
+										  LVParallelState *lps, int *maxdeadtups);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -351,7 +391,8 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 									LVRelStats *vacrelstats, LVParallelState *lps,
 									int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
+							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+							  IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_index(Relation indrel,
 							   IndexBulkDeleteResult **stats,
 							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
@@ -362,7 +403,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 int nindexes);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -381,7 +423,8 @@ static void vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 								  int nindexes);
 static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
+							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+							 IndexVacuumStrategy ivstrat);
 static void lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 									 LVRelStats *vacrelstats, LVParallelState *lps,
 									 int nindexes);
@@ -398,7 +441,8 @@ static LVParallelState *begin_parallel_vacuum(Oid relid, Relation *Irel,
 static void end_parallel_vacuum(IndexBulkDeleteResult **stats,
 								LVParallelState *lps, int nindexes);
 static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
-static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared,
+									   IndexVacuumStrategy ivstrat);
 static void vacuum_error_callback(void *arg);
 static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info,
 									 int phase, BlockNumber blkno,
@@ -442,7 +486,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -501,8 +544,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrelstats->hasindex = (nindexes > 0);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -763,6 +805,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
+	int			maxdeadtups = 0;	/* maximum # of dead tuples in a single page */
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
 				tups_vacuumed,	/* tuples cleaned up by vacuum */
@@ -811,14 +854,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->nonempty_pages = 0;
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
+	/*
+	 * index vacuum cleanup is enabled if index cleanup is not disabled,
+	 * i.g., it's true when either default or enabled.
+	 */
+	vacrelstats->indexcleanup_requested =
+		(params->index_cleanup != VACOPT_TERNARY_DISABLED);
+
 	vistest = GlobalVisTestFor(onerel);
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * least two indexes on a table. When the index cleanup is disabled,
+	 * since index bulk-deletion is likely to be no-op we disable a parallel
+	 * vacuum.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1 &&
+		params->index_cleanup != VACOPT_TERNARY_DISABLED)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -846,7 +899,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -1050,19 +1103,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Vacuum the table and its indexes */
+			lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+										  Irel, nindexes, indstats,
+										  lps, &maxdeadtups);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1512,32 +1556,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/*
 		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
+		 * doing a second scan.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrelstats->hasindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_tuples = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(nindexes == 0);
+
+			/* Remove tuples from heap if the table has no index */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+			has_dead_tuples = false;
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1663,6 +1691,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+		else
+			maxdeadtups = Max(maxdeadtups,
+							  dead_tuples->num_tuples - prev_dead_count);
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1702,14 +1733,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, params, vacrelstats,
+									  Irel, nindexes, indstats,
+									  lps, &maxdeadtups);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1722,7 +1748,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1775,6 +1801,140 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, VacuumParams *params,
+							  LVRelStats *vacrelstats, Relation *Irel,
+							  int nindexes, IndexBulkDeleteResult **indstats,
+							  LVParallelState *lps, int *maxdeadtups)
+{
+	/*
+	 * Choose the vacuum strategy for this vacuum cycle.
+	 * choose_vacuum_strategy() will set the decision to
+	 * vacrelstats->vacuum_heap.
+	 */
+	choose_vacuum_strategy(onerel, vacrelstats, params, Irel, nindexes,
+						   *maxdeadtups);
+
+	/* Work on all the indexes, then the heap */
+	lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+							nindexes);
+
+	if (vacrelstats->vacuum_heap)
+	{
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * Here, we don't do heap vacuum in this cycle.
+		 *
+		 * Note that vacrelstats->dead_tuples could have tuples which
+		 * became dead after HOT-pruning but are not marked dead yet.
+		 * We do not process them because it's a very rare condition,
+		 * and the next vacuum will process them anyway.
+		 */
+		Assert(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+	*maxdeadtups = 0;
+}
+
+/*
+ * Decide whether or not we remove the collected garbage tuples from the
+ * heap. The decision is set to vacrelstats->vacuum_heap. ndeaditems is
+ * maximum number of LP_DEAD items on any one heap page encountered during
+ * heap scan.
+ */
+static void
+choose_vacuum_strategy(Relation onerel, LVRelStats *vacrelstats,
+					   VacuumParams *params, Relation *Irel, int nindexes,
+					   int ndeaditems)
+{
+	bool vacuum_heap = true;
+	int i;
+
+	/*
+	 * Ask each index the vacuum strategy, and save them. If even on index
+	 * returns 'none', we can skip heap vacuum in this cycle at least from
+	 * the index strategies point of view. The consequence might be overwritten
+	 * by other factors, see below.
+	 */
+	for (i = 0; i < nindexes; i++)
+	{
+		IndexVacuumInfo ivinfo;
+
+		ivinfo.index = Irel[i];
+		ivinfo.message_level = elevel;
+
+		/* Save the returned value */
+		vacrelstats->ivstrategies[i] = index_vacuum_strategy(&ivinfo, params);
+
+		if (vacrelstats->ivstrategies[i] == INDEX_VACUUM_STRATEGY_NONE)
+			vacuum_heap = false;
+	}
+
+	/* If index cleanup option is specified, overwrite the consequence */
+	if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+		vacuum_heap = true;
+	else if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		vacuum_heap = false;
+	else if (!vacuum_heap)
+	{
+		Size freespace = RelationGetTargetPageFreeSpace(onerel,
+														HEAP_DEFAULT_FILLFACTOR);
+		int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) *
+									  DEAD_ITEMS_ON_PAGE_LIMIT_SAFETY_RATIO);
+
+		/*
+		 * Check whether we need to delete the collected garbage from the heap,
+		 * from the heap point of view.
+		 *
+		 * The test of ndeaditems_limit is for the maximum number of LP_DEAD
+		 * items on any one heap page encountered during heap scan by caller.
+		 * The general idea here is to preserve the original pristine state of
+		 * the table when it is subject to constant non-HOT updates when heap
+		 * fill factor is reduced from its default.
+		 *
+		 * To calculate how many LP_DEAD line pointers can be stored into the
+		 * space of a heap page left by fillfactor, we need to consider it from
+		 * two aspects: the size left by fillfactor and the maximum number of
+		 * heap tuples per pages, i.e., MaxHeapTuplesPerPage.  ndeaditems_limit
+		 * is calculated by using the freespace left by fillfactor -- we can fit
+		 * (freespace / sizeof(ItemIdData)) LP_DEAD items on a heap page before
+		 * they start to "overflow" with that setting, from the perspective of
+		 *  the space.  However, we cannot always store the calculated number of
+		 * LP_DEAD line pointers because of MaxHeapTuplesPerPage -- the total
+		 * number of line pointers in a heap page cannot exceed
+		 * MaxHeapTuplesPerPage. For example, with the small tuples, we can store
+		 * the more tuples in a heap page, meaning consuming the more free line
+		 * pointers to store heap tuples. So leaving line pointers as LP_DEAD
+		 * could consume line pointers that are supposed to store heap tuples,
+		 * resulting in an overflow.
+		 *
+		 * The below calculation, however, considers only the former aspect,
+		 * the space, because (1) MaxHeapTuplesPerPage is defined while
+		 * considering to accumulate a certain amount of LP_DEAD line pointers
+		 * and (2) to simplify the calculation. Thanks to (1) we don't need to
+		 * consider the upper bound in most cases.
+		 */
+		if (ndeaditems > ndeaditems_limit)
+			vacuum_heap = true;
+	}
+
+	vacrelstats->vacuum_heap = vacuum_heap;
+}
+
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *
@@ -1818,7 +1978,8 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 
 		for (idx = 0; idx < nindexes; idx++)
 			lazy_vacuum_index(Irel[idx], &stats[idx], vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
+							  vacrelstats->old_live_tuples, vacrelstats,
+							  vacrelstats->ivstrategies[idx]);
 	}
 
 	/* Increase and report the number of index scans */
@@ -1827,7 +1988,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
  *	lazy_vacuum_heap() -- second pass over the heap
  *
@@ -2092,7 +2252,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 							 LVRelStats *vacrelstats, LVParallelState *lps,
 							 int nindexes)
 {
-	int			nworkers;
+	int			nworkers = 0;
 
 	Assert(!IsParallelWorker());
 	Assert(ParallelVacuumIsActive(lps));
@@ -2108,10 +2268,32 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 			nworkers = lps->nindexes_parallel_cleanup;
 	}
 	else
-		nworkers = lps->nindexes_parallel_bulkdel;
+	{
+		if (vacrelstats->vacuum_heap)
+			nworkers = lps->nindexes_parallel_bulkdel;
+		else
+		{
+			int i;
+
+			/*
+			 * If we don't vacuum heap, index bulk-deletion could be skipped
+			 * depending on indexes. So we calculate how many indexes will do
+			 * index bulk-deletion based on the answers to amvacuumstrategy.
+			 */
+			for (i = 0; i < nindexes; i++)
+			{
+				uint8 vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+
+				if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0 &&
+					vacrelstats->ivstrategies[i] == INDEX_VACUUM_STRATEGY_BULKDELETE)
+					nworkers++;
+			}
+		}
+	}
 
 	/* The leader process will participate */
-	nworkers--;
+	if (nworkers > 0)
+		nworkers--;
 
 	/*
 	 * It is possible that parallel context is initialized with fewer workers
@@ -2120,6 +2302,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
 	 */
 	nworkers = Min(nworkers, lps->pcxt->nworkers);
 
+	/* Copy the information to the shared state */
+	lps->lvshared->vacuum_heap = vacrelstats->vacuum_heap;
+	lps->lvshared->indexcleanup_requested = vacrelstats->indexcleanup_requested;
+
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
@@ -2249,12 +2435,14 @@ parallel_vacuum_index(Relation *Irel, IndexBulkDeleteResult **stats,
 		 * operation
 		 */
 		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[idx], lvshared))
+			skip_parallel_vacuum_index(Irel[idx], lvshared,
+									   vacrelstats->ivstrategies[idx]))
 			continue;
 
 		/* Do vacuum or cleanup of the index */
 		vacuum_one_index(Irel[idx], &(stats[idx]), lvshared, shared_indstats,
-						 dead_tuples, vacrelstats);
+						 dead_tuples, vacrelstats,
+						 vacrelstats->ivstrategies[idx]);
 	}
 
 	/*
@@ -2292,10 +2480,11 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 
 		/* Process the indexes skipped by parallel workers */
 		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
+			skip_parallel_vacuum_index(Irel[i], lps->lvshared,
+									   vacrelstats->ivstrategies[i]))
 			vacuum_one_index(Irel[i], &(stats[i]), lps->lvshared,
 							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
+							 vacrelstats, vacrelstats->ivstrategies[i]);
 	}
 
 	/*
@@ -2315,7 +2504,8 @@ vacuum_indexes_leader(Relation *Irel, IndexBulkDeleteResult **stats,
 static void
 vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
+				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats,
+				 IndexVacuumStrategy ivstrat)
 {
 	IndexBulkDeleteResult *bulkdelete_res = NULL;
 
@@ -2338,7 +2528,7 @@ vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 						   lvshared->estimated_count, vacrelstats);
 	else
 		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
+						  lvshared->reltuples, vacrelstats, ivstrat);
 
 	/*
 	 * Copy the index bulk-deletion result returned from ambulkdelete and
@@ -2429,7 +2619,8 @@ lazy_cleanup_all_indexes(Relation *Irel, IndexBulkDeleteResult **stats,
  */
 static void
 lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats,
+				  IndexVacuumStrategy ivstrat)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2443,7 +2634,9 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vac_strategy; /* buffer access strategy */
+	ivinfo.will_vacuum_heap = vacrelstats->vacuum_heap;
+	ivinfo.indvac_strategy = ivstrat; /* index vacuum strategy */
 
 	/*
 	 * Update error traceback information.
@@ -2461,11 +2654,17 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	*stats = index_bulk_delete(&ivinfo, *stats,
 							   lazy_tid_reaped, (void *) dead_tuples);
 
-	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+	/*
+	 * Report the index bulk-deletion stats. If the index returns the
+	 * statistics and we will do vacuum heap, we can assume it have
+	 * done the index bulk-deletion.
+	 */
+	if (*stats && vacrelstats->vacuum_heap)
+		ereport(elevel,
+				(errmsg("scanned index \"%s\" to remove %d row versions",
+						vacrelstats->indname,
+						dead_tuples->num_tuples),
+				 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
@@ -2498,6 +2697,7 @@ lazy_cleanup_index(Relation indrel,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vac_strategy;
+	ivinfo.vacuumcleanup_requested = vacrelstats->indexcleanup_requested;
 
 	/*
 	 * Update error traceback information.
@@ -2844,14 +3044,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2876,18 +3076,21 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+				 int nindexes)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
 	vacrelstats->dead_tuples = dead_tuples;
+	vacrelstats->ivstrategies =
+		(IndexVacuumStrategy *) palloc0(SizeOfIndVacStrategies(nindexes));
 }
 
 /*
@@ -3223,10 +3426,12 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	LVDeadTuples *dead_tuples;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	IndexVacuumStrategy *ivstrats;
 	bool	   *can_parallel_vacuum;
 	long		maxtuples;
 	Size		est_shared;
 	Size		est_deadtuples;
+	Size		est_ivstrategies;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -3320,6 +3525,13 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Estimate space for IndexVacuumStrategy -- PARALLEL_VACUUM_KEY_IND_STRATEGY.
+	 */
+	est_ivstrategies = MAXALIGN(SizeOfIndVacStrategies(nindexes));
+	shm_toc_estimate_chunk(&pcxt->estimator, est_ivstrategies);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
 	{
@@ -3372,6 +3584,11 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	lps->wal_usage = wal_usage;
 
+	/* Allocate space for each index strategy */
+	ivstrats = shm_toc_allocate(pcxt->toc, est_ivstrategies);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, ivstrats);
+	vacrelstats->ivstrategies = ivstrats;
+
 	/* Store query string for workers */
 	if (debug_query_string)
 	{
@@ -3461,7 +3678,8 @@ get_indstats(LVShared *lvshared, int n)
  * or parallel index cleanup, false, otherwise.
  */
 static bool
-skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
+skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared,
+						   IndexVacuumStrategy ivstrat)
 {
 	uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
@@ -3485,9 +3703,18 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 			((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
 			return true;
 	}
-	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0)
+	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0 ||
+			 (!lvshared->vacuum_heap &&
+			  ivstrat == INDEX_VACUUM_STRATEGY_NONE))
 	{
-		/* Skip if the index does not support parallel bulk deletion */
+		/*
+		 * Skip if the index does not support parallel bulk deletion.
+		 *
+		 * Also, the index is skipped if we don't require the index to delete
+		 * garbage and it also doesn't want to do that.  Since the index
+		 * bulk-deletion is ikely to be no-op, we don't launch parallel
+		 * workers for it.
+		 */
 		return true;
 	}
 
@@ -3507,6 +3734,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	LVShared   *lvshared;
 	LVDeadTuples *dead_tuples;
+	IndexVacuumStrategy *ivstrats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -3548,6 +3776,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 												  PARALLEL_VACUUM_KEY_DEAD_TUPLES,
 												  false);
 
+	/* Set vacuum strategy space */
+	ivstrats = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_IND_STRATEGY, false);
+	vacrelstats.ivstrategies = ivstrats;
+
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
 	VacuumCostBalance = 0;
@@ -3573,6 +3805,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	vacrelstats.indname = NULL;
 	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
 
+	vacrelstats.vacuum_heap = lvshared->vacuum_heap;
+	vacrelstats.indexcleanup_requested = lvshared->indexcleanup_requested;
+
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = &vacrelstats;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..171ba5c2fa 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -678,6 +678,28 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+/* ----------------
+ *		index_vacuum_strategy - ask index vacuum strategy
+ *
+ * This callback routine is called just before vacuuming the heap.
+ * Returns IndexVacuumStrategy value to tell the lazy vacuum whether to
+ * do index deletion.
+ * ----------------
+ */
+IndexVacuumStrategy
+index_vacuum_strategy(IndexVacuumInfo *info, struct VacuumParams *params)
+{
+	Relation	indexRelation = info->index;
+
+	RELATION_CHECKS;
+
+	/* amvacuumstrategy is optional; assume do bulk-deletion */
+	if (indexRelation->rd_indam->amvacuumstrategy == NULL)
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+
+	return indexRelation->rd_indam->amvacuumstrategy(info, params);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..e00e5fe0a4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -133,6 +133,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amvacuumstrategy = btvacuumstrategy;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -822,6 +823,18 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 		 */
 		result = true;
 	}
+	else if (!info->vacuumcleanup_requested)
+	{
+		/*
+		 * Skip cleanup if INDEX_CLEANUP is set to false, even if there might
+		 * be a deleted page that can be recycled. If INDEX_CLEANUP continues
+		 * to be disabled, recyclable pages could be left by XID wraparound.
+		 * But in practice it's not so harmful since such workload doesn't need
+		 * to delete and recycle pages in any case and deletion of btree index
+		 * pages is relatively rare.
+		 */
+		result = false;
+	}
 	else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 			 GlobalVisCheckRemovableXid(NULL, metad->btm_oldest_btpo_xact))
 	{
@@ -864,6 +877,19 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return result;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+btvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -878,6 +904,14 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	Relation	rel = info->index;
 	BTCycleId	cycleid;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted and we want to skip it.
+	 */
+	if (!info->will_vacuum_heap &&
+		info->indvac_strategy == INDEX_VACUUM_STRATEGY_NONE)
+		return stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..7b2313590a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amvacuumstrategy = spgvacuumstrategy;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..f44043d94f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -894,6 +894,19 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
+/*
+ * Choose the vacuum strategy. Do bulk-deletion unless index cleanup
+ * is specified to off.
+ */
+IndexVacuumStrategy
+spgvacuumstrategy(IndexVacuumInfo *info, VacuumParams *params)
+{
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+		return INDEX_VACUUM_STRATEGY_NONE;
+	else
+		return INDEX_VACUUM_STRATEGY_BULKDELETE;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -907,6 +920,13 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 {
 	spgBulkDeleteState bds;
 
+	/*
+	 * Skip deleting index entries if the corresponding heap tuples will
+	 * not be deleted.
+	 */
+	if (!info->will_vacuum_heap)
+		return NULL;
+
 	/* allocate stats if first time through, else re-use existing struct */
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
@@ -937,8 +957,11 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	spgBulkDeleteState bds;
 
-	/* No-op in ANALYZE ONLY mode */
-	if (info->analyze_only)
+	/*
+	 * No-op in ANALYZE ONLY mode or when user requests to disable index
+	 * cleanup.
+	 */
+	if (info->analyze_only || !info->vacuumcleanup_requested)
 		return stats;
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..30b48d6ccb 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3401,6 +3401,8 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.will_vacuum_heap = true;
+	ivinfo.indvac_strategy = INDEX_VACUUM_STRATEGY_BULKDELETE;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..111addbd6c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -668,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.vacuumcleanup_requested = true;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 462f9a0f82..4ab20b77e6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1870,17 +1870,20 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option if vacuum_index_cleanup reloption is set.
+	 * Otherwise we leave it as 'default', which means that we choose vacuum
+	 * strategy based on the table and index status. See choose_vacuum_strategy().
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
-	/* Set truncate option based on reloptions if not yet */
+	/*
+	 * Set truncate option based on reloptions if not yet. Truncate option
+	 * is true by default.
+	 */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
 	{
 		if (onerel->rd_options == NULL ||
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..548f2033a4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -22,8 +22,9 @@
 struct PlannerInfo;
 struct IndexPath;
 
-/* Likewise, this file shouldn't depend on execnodes.h. */
+/* Likewise, this file shouldn't depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 
 /*
@@ -112,6 +113,9 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   IndexUniqueCheck checkUnique,
 								   bool indexUnchanged,
 								   struct IndexInfo *indexInfo);
+/* vacuum strategy */
+typedef IndexVacuumStrategy (*amvacuumstrategy_function) (IndexVacuumInfo *info,
+														  struct VacuumParams *params);
 
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
@@ -259,6 +263,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amvacuumstrategy_function amvacuumstrategy;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 0eab1508d3..f164ec1a54 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -21,8 +21,9 @@
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-/* We don't want this file to depend on execnodes.h. */
+/* We don't want this file to depend on execnodes.h and vacuum.h. */
 struct IndexInfo;
+struct VacuumParams;
 
 /*
  * Struct for statistics returned by ambuild
@@ -33,8 +34,17 @@ typedef struct IndexBuildResult
 	double		index_tuples;	/* # of tuples inserted into index */
 } IndexBuildResult;
 
+/* Result value for amvacuumstrategy */
+typedef enum IndexVacuumStrategy
+{
+	INDEX_VACUUM_STRATEGY_NONE,			/* No-op, skip bulk-deletion in this
+										 * vacuum cycle */
+	INDEX_VACUUM_STRATEGY_BULKDELETE	/* Do ambulkdelete */
+} IndexVacuumStrategy;
+
 /*
- * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
+ * Struct for input arguments passed to amvacuumstrategy, ambulkdelete
+ * and amvacuumcleanup
  *
  * num_heap_tuples is accurate only when estimated_count is false;
  * otherwise it's just an estimate (currently, the estimate is the
@@ -50,6 +60,26 @@ typedef struct IndexVacuumInfo
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
+
+	/*
+	 * True if lazy vacuum delete the collected garbage tuples from the
+	 * heap.  If it's false, the index AM can skip index bulk-deletion
+	 * safely.  This field is used only for ambulkdelete.
+	 */
+	bool		will_vacuum_heap;
+
+	/*
+	 * The answer to amvacuumstrategy asked before executing ambulkdelete.
+	 * This field is used only for ambulkdelete.
+	 */
+	IndexVacuumStrategy indvac_strategy;
+
+	/*
+	 * amvacuumcleanup is requested by lazy vacuum. If false, the index AM
+	 * can skip index cleanup. This can be false if INDEX_CLEANUP vacuum option
+	 * is set to false. This field is used only for amvacuumcleanup.
+	 */
+	bool		vacuumcleanup_requested;
 } IndexVacuumInfo;
 
 /*
@@ -174,6 +204,8 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexVacuumStrategy index_vacuum_strategy(IndexVacuumInfo *info,
+												 struct VacuumParams *params);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 												IndexBulkDeleteResult *stats,
 												IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 670a40b4be..5c48a48917 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -397,6 +397,8 @@ extern int64 gingetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void ginInitConsistentFunction(GinState *ginstate, GinScanKey key);
 
 /* ginvacuum.c */
+extern IndexVacuumStrategy ginvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *ginbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..303a18da4d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -533,6 +533,8 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
+extern IndexVacuumStrategy gistvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 1cce865be2..4c7e064708 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -372,6 +372,8 @@ extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
 extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					   ScanKey orderbys, int norderbys);
 extern void hashendscan(IndexScanDesc scan);
+extern IndexVacuumStrategy hashvacuumstrategy(IndexVacuumInfo *info,
+											  struct VacuumParams *params);
 extern IndexBulkDeleteResult *hashbulkdelete(IndexVacuumInfo *info,
 											 IndexBulkDeleteResult *stats,
 											 IndexBulkDeleteCallback callback,
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7c62852e7f..9615194db6 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -563,17 +563,24 @@ do { \
 /*
  * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
  * fit on one heap page.  (Note that indexes could have more, because they
- * use a smaller tuple header.)  We arrive at the divisor because each tuple
- * must be maxaligned, and it must have an associated line pointer.
+ * use a smaller tuple header.)
  *
- * Note: with HOT, there could theoretically be more line pointers (not actual
- * tuples) than this on a heap page.  However we constrain the number of line
- * pointers to this anyway, to avoid excessive line-pointer bloat and not
- * require increases in the size of work arrays.
+ * We used to constrain the number of line pointers to avlid excessive
+ * line-pointer bloat and not require increases in the size of work arrays,
+ * calculating it using by the size of aligned heap tuple header. But since
+ * index vacuum strategy had entered the picture, accumulating LP_DEAD line
+ * pointers in a heap page has a value for skipping index deletion. So we
+ * relaxed the limitation by considering a certain number of line pointers in
+ * a heap page that don't have heap tuples, calculating it using by 1
+ * MAXALIGN() quantum instead of the aligned size of heap tuple header, 3
+ * MAXALIGN() quantums.
+ *
+ * Please note that increasing this values also affects TID bitmap. There
+ * might be a risk of intrducing performance regression affecting bitmap scans.
  */
 #define MaxHeapTuplesPerPage	\
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
-			(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData))))
+			(MAXIMUM_ALIGNOF + sizeof(ItemIdData))))
 
 /*
  * MaxAttrSize is a somewhat arbitrary upper limit on the declared size of
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..ba120d4a80 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1011,6 +1011,8 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexVacuumStrategy btvacuumstrategy(IndexVacuumInfo *info,
+											struct VacuumParams *params);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 										   IndexBulkDeleteResult *stats,
 										   IndexBulkDeleteCallback callback,
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 2eb2f421a8..f591b21ef1 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -212,6 +212,8 @@ extern bool spggettuple(IndexScanDesc scan, ScanDirection dir);
 extern bool spgcanreturn(Relation index, int attno);
 
 /* spgvacuum.c */
+extern IndexVacuumStrategy spgvacuumstrategy(IndexVacuumInfo *info,
+											 struct VacuumParams *params);
 extern IndexBulkDeleteResult *spgbulkdelete(IndexVacuumInfo *info,
 											IndexBulkDeleteResult *stats,
 											IndexBulkDeleteCallback callback,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 191cbbd004..f2590c3b6e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -184,19 +185,6 @@ typedef struct VacAttrStats
 #define VACOPT_SKIPTOAST 0x40	/* don't process the TOAST table, if any */
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
 
-/*
- * A ternary value used by vacuum parameters.
- *
- * DEFAULT value is used to determine the value based on other
- * configurations, e.g. reloptions.
- */
-typedef enum VacOptTernaryValue
-{
-	VACOPT_TERNARY_DEFAULT = 0,
-	VACOPT_TERNARY_DISABLED,
-	VACOPT_TERNARY_ENABLED,
-} VacOptTernaryValue;
-
 /*
  * Parameters customizing behavior of VACUUM and ANALYZE.
  *
@@ -216,8 +204,10 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup. In
+										 * default mode, it's decided based on
+										 * multiple factors. See
+										 * choose_vacuum_strategy. */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..168dc5d466 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,20 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+/*
+ * A ternary value used by vacuum parameters. This value also is used
+ * for VACUUM command options.
+ *
+ * DEFAULT value is used to determine the value based on other
+ * configurations, e.g. reloptions.
+ */
+typedef enum VacOptTernaryValue
+{
+	VACOPT_TERNARY_DEFAULT = 0,
+	VACOPT_TERNARY_DISABLED,
+	VACOPT_TERNARY_ENABLED,
+} VacOptTernaryValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +318,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptTernaryValue	vacuum_index_cleanup;	/* enables index vacuuming
+												 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 } StdRdOptions;
 
diff --git a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
index 4d0beaecea..8ad3e998e1 100644
--- a/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
+++ b/src/test/modules/test_ginpostinglist/expected/test_ginpostinglist.out
@@ -6,11 +6,11 @@ CREATE EXTENSION test_ginpostinglist;
 SELECT test_ginpostinglist();
 NOTICE:  testing with (0, 1), (0, 2), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (0, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (0, 680), max 14 bytes
 NOTICE:  encoded 2 item pointers to 10 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 14 bytes
+NOTICE:  testing with (0, 1), (4294967294, 680), max 14 bytes
 NOTICE:  encoded 1 item pointers to 8 bytes
-NOTICE:  testing with (0, 1), (4294967294, 291), max 16 bytes
+NOTICE:  testing with (0, 1), (4294967294, 680), max 16 bytes
 NOTICE:  encoded 2 item pointers to 16 bytes
  test_ginpostinglist 
 ---------------------
-- 
2.27.0

#23

Zhihong Yu

zyu@yugabyte.com

almost 5 years ago

In reply to: Masahiko Sawada (#21)

Re: New IndexAM API controlling index vacuum strategies

Hi,
bq. We can mention in the commit log that since the commit changes
MaxHeapTuplesPerPage the encoding in gin posting list is also changed.

Yes - this is fine.

Thanks

On Mon, Jan 25, 2021 at 12:28 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Show quoted text

(Please avoid top-posting on the mailing lists[1]: top-posting breaks
the logic of a thread.)

On Tue, Jan 19, 2021 at 12:02 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi, Masahiko-san:

Thank you for reviewing the patch!
For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch :

For blvacuumstrategy():
+   if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+       return INDEX_VACUUM_STRATEGY_NONE;
+   else
+       return INDEX_VACUUM_STRATEGY_BULKDELETE;
The 'else' can be omitted.
Yes, but I'd prefer to leave it as it is because it's more readable
without any performance side effect that we return BULKDELETE if index
cleanup is enabled.

Similar comment for ginvacuumstrategy().

For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch :

If index_cleanup option is specified neither VACUUM command nor
storage option

I think this is what you meant (by not using passive voice):

If index_cleanup option specifies neither VACUUM command nor
storage option,

- * integer, but you can't fit that many items on a page. 11 ought to be

more

+ * integer, but you can't fit that many items on a page. 13 ought to be

more

It would be nice to add a note why the number of bits is increased.

I think that it might be better to mention such update history in the
commit log rather than in the source code. Because most readers are
likely to be interested in why 12 bits for offset number is enough,
rather than why this value has been increased. In the source code
comment, we describe why 12 bits for offset number is enough. We can
mention in the commit log that since the commit changes
MaxHeapTuplesPerPage the encoding in gin posting list is also changed.
What do you think?

For choose_vacuum_strategy():

+ IndexVacuumStrategy ivstrat;

The variable is only used inside the loop. You can use

vacrelstats->ivstrategies[i] directly and omit the variable.

Fixed.

+ int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) *

0.7);

How was the factor of 0.7 determined ? Comment below only mentioned

'safety factor' but not how it was chosen.

I also wonder if this factor should be exposed as GUC.

Fixed.

+ if (nworkers > 0)
+ nworkers--;

Should log / assert be added when nworkers is <= 0 ?

Hmm I don't think so. As far as I read the code, there is no
possibility that nworkers can be lower than 0 (we always increment it)
and actually, the code works fine even if nworkers is a negative
value.

+ * XXX: allowing to fill the heap page with only line pointer seems a

overkill.

'a overkill' -> 'an overkill'

Fixed.

The above comments are incorporated into the latest patch I just posted[2].

[1] https://en.wikipedia.org/wiki/Posting_style#Top-posting
[2]
/messages/by-id/CAD21AoCS94vK1fs-_=R5J3tp2DsZPq9zdcUu2pk6fbr7BS7quA@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#24

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#19)

Re: New IndexAM API controlling index vacuum strategies

On Fri, Jan 22, 2021 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

Looks good. I'll give this version a review now. I will do a lot more
soon. I need to come up with a good benchmark for this, that I can
return to again and again during review as needed.

I performed another benchmark, similar to the last one but with the
latest version (v2), and over a much longer period. Attached is a
summary of the whole benchmark, and log_autovacuum output from the
logs of both the master branch and the patch.

Thank you for performing the benchmark!

This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
transaction with one update and two selects. Each run was 4 hours, and
we alternate between patch and master for each run, and alternate
between 16 and 32 clients. There were 8 4 hour runs in total, meaning
the entire set of runs took 8 * 4 hours = 32 hours (not including
initial load time and a few other small things like that). I used a
10k TPS rate limit, so TPS isn't interesting here. Latency is
interesting -- we see a nice improvement in latency (i.e. a reduction)
for the patch (see all.summary.out).

What value is set to fillfactor?

The benefits of the patch are clearly visible when I drill down and
look at the details. Each pgbench_accounts autovacuum VACUUM operation
can finish faster with the patch because they can often skip at least
some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
it's more subtle than some might assume. We're skipping indexes that
VACUUM actually would have deleted *some* index tuples from, which is
very good. Bottom-up index deletion is usually lazy, and only
occasionally very eager, so you still have plenty of "floating
garbage" index tuples in most pages. And now we see VACUUM behave a
little more like bottom-up index deletion -- it is lazy when that is
appropriate (with indexes that really only have floating garbage that
is spread diffusely throughout the index structure), and eager when
that is appropriate (with indexes that get much more garbage).

That's very good. I'm happy that this patch efficiently utilizes
bottom-up index deletion feature.

Looking at the relation size growth, there is almost no difference
between master and patched in spite of skipping some vacuums in the
patched test, which is also good.

The benefit is not really that we're avoiding doing I/O for index
vacuuming (though that is one of the smaller benefits here). The real
benefit is that VACUUM is not dirtying pages, since it skips indexes
when it would be "premature" to vacuum them from an efficiency point
of view. This is important because we know that Postgres throughput is
very often limited by page cleaning. Also, the "economics" of this new
behavior make perfect sense -- obviously it's more efficient to delay
garbage cleanup until the point when the same page will be modified by
a backend anyway -- in the case of this benchmark via bottom-up index
deletion (which deletes all garbage tuples in the leaf page at the
point that it runs for a subset of pointed-to heap pages -- it's not
using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
a page, we now get more value per additional-page-dirtied.

I believe that controlling the number of pages dirtied by VACUUM is
usually much more important than reducing the amount of read I/O from
VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
default value and modern hardware" thread. As a further consequence of
all this, VACUUM can go faster safely and sustainably (since the cost
limit is not affected so much by vacuum_cost_page_miss), which has its
own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
end).

Another closely related huge improvement that we see here is that the
number of FPIs generated by VACUUM can be significantly reduced. This
cost is closely related to the cost of dirtying pages, but it's worth
mentioning separately. You'll see some of that in the log_autovacuum
log output I attached.

Makes sense.

There is an archive with much more detailed information, including
dumps from most pg_stat_* views at key intervals. This has way more
information than anybody is likely to want:

https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing

I did notice a problem, though. I now think that the criteria for
skipping an index vacuum in the third patch from the series is too
conservative, and that this led to an excessive number of index
vacuums with the patch.

Maybe that's why there are 5 autovacuum runs on pgbench_accounts in
the master branch whereas there are 7 runs in the patched?

This is probably because there was a tiny
number of page splits in some of the indexes that were not really
supposed to grow. I believe that this is caused by ANALYZE running --
I think that it prevented bottom-up deletion from keeping a few of the
hottest pages from splitting (that can take 5 or 6 seconds) at a few
points over the 32 hour run. For example, the index named "tenner"
grew by 9 blocks, starting out at 230,701 and ending up at 230,710 (to
see this, extract the files from the archive and "diff
patch.r1c16.initial_pg_relation_size.out
patch.r2c32.after_pg_relation_size.out").

I now think that 0 blocks added is unnecessarily restrictive -- a
small tolerance still seems like a good idea, though (let's still be
somewhat conservative about it).

Agreed.

Maybe a better criteria would be for nbtree to always proceed with
index vacuuming when the index size is less than 2048 blocks (16MiB
with 8KiB BLCKSZ). If an index is larger than that, then compare the
last/old block count to the current block count (at the point that we
decide if index vacuuming is going to go ahead) by rounding up both
values to the next highest 2048 block increment. This formula is
pretty arbitrary, but probably works as well as most others. It's a
good iteration for the next version of the patch/further testing, at
least.

Also makes sense to me. The patch I recently submitted doesn't include
it but I'll do that in the next version patch.

Maybe the same is true for heap? I mean that skipping heap vacuum on a
too-small table will not bring the benefit but bloat. I think we could
proceed with heap vacuum if a table is smaller than a threshold, even
if one of the indexes wanted to skip.

BTW, it would be nice if there was more instrumentation, say in the
log output produced when log_autovacuum is on. That would make it
easier to run these benchmarks -- I could verify my understanding of
the work done for each particular av operation represented in the log.
Though the default log_autovacuum log output is quite informative, it
would be nice if the specifics were more obvious (maybe this could
just be for the review/testing, but it might become something for
users if it seems useful).

Yeah, I think the following information would also be helpful:

* did vacuum heap? or skipped it?
* how many indexes did/didn't bulk-deletion?
* time spent for each vacuum phase.
etc

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#25

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#24)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Jan 26, 2021 at 10:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

What value is set to fillfactor?

90, same as before.

That's very good. I'm happy that this patch efficiently utilizes
bottom-up index deletion feature.

Me too!

Looking at the relation size growth, there is almost no difference
between master and patched in spite of skipping some vacuums in the
patched test, which is also good.

Right. Stability is everything here. Actually I think that most
performance problems in Postgres are mostly about stability if you
really look into it.

I did notice a problem, though. I now think that the criteria for
skipping an index vacuum in the third patch from the series is too
conservative, and that this led to an excessive number of index
vacuums with the patch.

Maybe that's why there are 5 autovacuum runs on pgbench_accounts in
the master branch whereas there are 7 runs in the patched?

Probably, but it might also be due to some other contributing factor.
There is still very little growth in the size of the indexes, and the
PK still has zero growth. The workload consists of 32 hours of a
10ktps workload, so I imagine that there is opportunity for some
extremely rare event to happen a few times. Too tired to think about
it in detail right now.

It might also be related to the simple fact that only one VACUUM
process may run against a table at any given time! With a big enough
table, and several indexes, and reasonably aggressive av settings,
it's probably almost impossible for autovacuum to "keep up" (in the
exact sense that the user asks for by having certain av settings).
This must be taken into account in some general way --

It's a bit tricky to interpret results here, generally speaking,
because there are probably a few things like that. To me, the most
important thing is that the new behavior "makes sense" in some kind of
general way, that applies across a variety of workloads. It may not be
possible to directly compare master and patch like this and arrive at
one simple number that is fair. If we really wanted one simple
benchmark number, maybe we'd have to tune the patch and master
separately -- which doesn't *seem* fair.

Also makes sense to me. The patch I recently submitted doesn't include
it but I'll do that in the next version patch.

Great!

Maybe the same is true for heap? I mean that skipping heap vacuum on a
too-small table will not bring the benefit but bloat. I think we could
proceed with heap vacuum if a table is smaller than a threshold, even
if one of the indexes wanted to skip.

I think that you're probably right about that. It isn't a problem for
v2 in practice because the bloat will reliably cause LP_DEAD line
pointers to accumulate in heap pages, so you VACUUM anyway -- this is
certainly what you *always* see in the small pgbench tables with the
default workload. But even then -- why not be careful? I agree that
there should be some kind of limit on table size that applies here --
a size at which we'll never apply any of these optimizations, no
matter what.

Yeah, I think the following information would also be helpful:

* did vacuum heap? or skipped it?
* how many indexes did/didn't bulk-deletion?
* time spent for each vacuum phase.

That list looks good -- in general I don't like that log_autovacuum
cannot ever have the VACUUM VERBOSE per-index output -- maybe that
could be revisited soon? I remember reading your e-mail about this on
a recent thread, and I imagine that you already saw the connection
yourself.

It'll be essential to have good instrumentation as we do more
benchmarking. We're probably going to have to make subjective
assessments of benchmark results, based on multiple factors. That will
probably be the only practical way to assess how much better (or
worse) the patch is compared to master. This patch is more about
efficiency and predictability than performance per se. Which is good,
because that's where most of the real world problems actually are.

--
Peter Geoghegan

#26

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#25)

Re: New IndexAM API controlling index vacuum strategies

On Fri, Jan 29, 2021 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

It'll be essential to have good instrumentation as we do more
benchmarking. We're probably going to have to make subjective
assessments of benchmark results, based on multiple factors. That will
probably be the only practical way to assess how much better (or
worse) the patch is compared to master. This patch is more about
efficiency and predictability than performance per se. Which is good,
because that's where most of the real world problems actually are.

I've been thinking about how to get this patch committed for
PostgreSQL 14. This will probably require cutting scope, so that the
initial commit is not so ambitious. I think that "incremental VACUUM"
could easily take up a lot of my time for Postgres 15, and maybe even
Postgres 16.

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

Here is a more detailed explanation:

Today we can skip all significant work in ambulkdelete() and
amvacuumcleanup() when there are zero dead tuples in the table. But
why is the threshold *precisely* zero? If we could treat cases that
have "practically zero" dead tuples in the same way (or almost the
same way) as cases with *exactly* zero dead tuple, that's still a big
improvement. And it still sets an important precedent that is crucial
for the wider "incremental VACUUM" project: the criteria for
triggering index vacuuming becomes truly "fuzzy" for the first time.
It is "fuzzy" in the sense that index vacuuming might not happen
during VACUUM at all now, even when the user didn't explicitly use
VACUUUM's INDEX_CLEANUP option, and even when more than *precisely*
zero dead index tuples are involved (though not *much* more than zero,
can't be too aggressive). That really is a big change.

A recap on vacuum_cleanup_index_scale_factor, just to avoid confusion:

The reader should note that this is very different to Masahiko's
vacuum_cleanup_index_scale_factor project, which skips *cleanup* in
VACUUM (not bulk delete), a question which only comes up when there
are definitely zero dead index tuples. The unifying work I'm talking
about now implies that we completely avoid scanning indexes during
vacuum, even when they are known to have at least a few dead index
tuples, and even when VACUUM's INDEX_CLEANUP emergency option is not
in use. Which, as I just said, is a big change.

Thoughts on triggering criteria for new "unified" design, ~99.9%
append-only tables:

Actually, in *one* sense the difference between "precisely zero" and
"practically zero" here *is* small. But it's still probably going to
result in skipping reading indexes during VACUUM in many important
cases. Like when you must VACUUM a table that is ~99.9% append-only.
In the real world things are rarely in exact discrete categories, even
when we imagine that they are. It's too easy to be wrong about one
tiny detail -- like one tiny UPDATE from 4 weeks ago, perhaps. Having
a tiny amount of "forgiveness" here is actually a *huge* improvement
on having precisely zero forgiveness. Small and big.

This should help cases that get big surprising spikes due to
anti-wraparound vacuums that must vacuum indexes for the first time in
ages -- indexes may be vacuumed despite only having a tiny absolute
number of dead tuples. I don't think that it's necessary to treat
anti-wraparound vacuums as special at all (not in Postgres 14 and
probably not ever), because simply considering cases where the table
has "practically zero" dead tuples alone should be enough. Vacuuming a
10GB index to delete only 10 tuples simply makes no sense. It doesn't
necessarily matter how we end up there, it just shouldn't happen.

The ~99.9% append-only table case is likely to be important and common
in the real world. We should start there for Postgres 14 because it's
easier, that's all. It's not fundamentally different to what happens
in workloads involving lots of bottom-up deletion -- it's just
simpler, and easier to reason about. Bottom-up deletion is an
important piece of the big puzzle here, but some variant of
"incremental VACUUM" really would still make sense in a world where
bottom-up index deletion does not exist. (In fact, I started thinking
about "incremental VACUUM" before bottom-up index deletion, and didn't
make any connection between them until work on bottom-up deletion had
already progressed significantly.)

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items from the table are to be removed by VACUUM --
whichever is greater. Of course this is just the first thing I thought
of. It's a starting point for further discussion.

My concerns won't be a surprise to you, Masahiko, but I'll list them
for the record. The bottom-up index deletion related complexity that I
want to avoid dealing with for Postgres 14 is in the following areas
(areas that Masahiko's patch dealt with):

* No need to teach indexes to do the amvacuumstrategy() stuff in
Postgres 14 -- so no need to worry about the exact criteria used
within AMs like nbtree to determine whether or not index vacuuming
seems appropriate from the "selfish" perspective of one particular
index.

I'm concerned that factors like bulk DELETEs, that may complicate
things for the amvacuumstrategy() routine -- doing something
relatively simple based on the recent growth of the index might have
downsides. Balancing competing considerations is hard.

* No need to change MaxHeapTuplesPerPage for now, since that only
really makes sense in cases that heavily involve bottom-up deletion,
where we care about the *concentration* of LP_DEAD line pointers in
heap pages (and not just the absolute number in the entire table),
which is qualitative, not quantitative (somewhat like bottom-up
deletion).

The change to MaxHeapTuplesPerPage that Masahiko has proposed does
make sense -- there are good reasons to increase it. Of course there
are also good reasons to not do so. I'm concerned that we won't have
time to think through all the possible consequences.

* Since "practically zero" dead tuples from a table still isn't very
many, the risk of "leaking" many deleted pages due to a known issue
with INDEX_CLEANUP in nbtree [1]/messages/by-id/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com -- Peter Geoghegan is much less significant. (FWIW I
doubt that skipping index vacuuming is the only way that we can fail
to recycle deleted pages anyway -- the FSM is not crash safe, of
course, plus I think that _bt_page_recyclable() might be broken in
other ways.)

In short: we can cut scope and de-risk the patch for Postgres 14 by
following this plan, while still avoiding unnecessary index vacuuming
within VACUUM in certain important cases. The high-level goal for this
patch has always been to recognize that index vacuuming is basically
wasted effort in certain cases. Cutting scope here merely means
addressing the relatively easy cases first, where simple triggering
logic will clearly be effective. I still strongly believe in
"incremental VACUUM".

What do you think of cutting scope like this for Postgres 14,
Masahiko? Sorry to change my mind, but I had to see the prototype to
come to this decision.

[1]: /messages/by-id/CA+TgmoYD7Xpr1DWEWWXxiw4-WC1NBJf3Rb9D2QGpVYH9ejz9fA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#27

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#26)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Feb 1, 2021 at 7:17 PM Peter Geoghegan <pg@bowt.ie> wrote:

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items from the table are to be removed by VACUUM --
whichever is greater. Of course this is just the first thing I thought
of. It's a starting point for further discussion.

And now here is the second thing I thought of, which is much better:

Sometimes 1% of the dead tuples in a heap relation will be spread
across 90%+ of the pages. With other workloads 1% of dead tuples might
be highly concentrated, and appear in no more than 1% of all heap
pages. Obviously the distinction between these two cases/workloads
matters a lot. And so the triggering criteria must be quantitative
*and* qualitative. It should not be based on counting dead tuples,
since that alone won't differentiate these two extreme cases - both of
which are probably quite common (in the real world extremes are
actually the normal and common case IME).

I like the idea of basing it on counting *heap blocks*, not dead
tuples. We can count heap blocks that have *at least* one dead tuple
(of course it doesn't matter how they're dead, whether it was this
VACUUM operation or some earlier opportunistic pruning). Note in
particular that it should not matter if it's a heap block that has
only one LP_DEAD line pointer or a heap page that is near the
MaxHeapTuplesPerPage limit for the page -- we count either type of
page towards the heap-page based limit used to decide if index
vacuuming goes ahead for all indexes during VACUUM.

This removes almost all of the issues with not setting visibility map
bits reliably (another concern of mine that I forget to mention
earlier), but it is still very likely to do the right thing in the
"~99.9% append-only table" case I mentioned in the last email. We can
be relatively aggressive (say by triggering index skipping when less
than 1% of all heap pages have dead tuples). Plus the new nbtree index
tuple deletion stuff is naturally very good at deleting index tuples
in the event of dead tuples becoming concentrated in relatively few
table blocks -- that helps too. This is true of the enhanced simple
deletion mechanism (which has been taught to be clever about table
block dead tuple concentrations/table layout), as well as bottom-up
index deletion.

--
Peter Geoghegan

#28

Victor Yegorov

vyegorov@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#27)

Re: New IndexAM API controlling index vacuum strategies

вт, 2 февр. 2021 г. в 05:27, Peter Geoghegan <pg@bowt.ie>:

And now here is the second thing I thought of, which is much better:

Sometimes 1% of the dead tuples in a heap relation will be spread
across 90%+ of the pages. With other workloads 1% of dead tuples might
be highly concentrated, and appear in no more than 1% of all heap
pages. Obviously the distinction between these two cases/workloads
matters a lot. And so the triggering criteria must be quantitative
*and* qualitative. It should not be based on counting dead tuples,
since that alone won't differentiate these two extreme cases - both of
which are probably quite common (in the real world extremes are
actually the normal and common case IME).

I like the idea of basing it on counting *heap blocks*, not dead
tuples. We can count heap blocks that have *at least* one dead tuple
(of course it doesn't matter how they're dead, whether it was this
VACUUM operation or some earlier opportunistic pruning). Note in
particular that it should not matter if it's a heap block that has
only one LP_DEAD line pointer or a heap page that is near the
MaxHeapTuplesPerPage limit for the page -- we count either type of
page towards the heap-page based limit used to decide if index
vacuuming goes ahead for all indexes during VACUUM.

I really like this idea!

It resembles the approach used in bottom-up index deletion, block-based
accounting provides a better estimate for the usefulness of the operation.

I suppose that 1% threshold should be configurable as a cluster-wide GUC
and also as a table storage parameter?

--
Victor Yegorov

#29

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Victor Yegorov (#28)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Feb 2, 2021 at 6:28 AM Victor Yegorov <vyegorov@gmail.com> wrote:

I really like this idea!

Cool!

It resembles the approach used in bottom-up index deletion, block-based
accounting provides a better estimate for the usefulness of the operation.

It does resemble bottom-up index deletion, in one important general
sense: it is somewhat qualitative (though *also* somewhat quantitive).
This is new for vacuumlazy.c. But the idea now is to deemphasize
bottom-up index deletion heavy workloads in the first version of this
patch -- just to cut scope.

The design I described yesterday centers around "~99.9% append-only
table" workloads, where anti-wraparound vacuums that scan indexes are
a big source of unnecessary work (in practice it is always
anti-wraparound vacuums, simply because there will never be enough
garbage to trigger a regular autovacuum run). But it now occurs to me
that there is another very important case that it will also help,
without making the triggering condition for index vacuuming any more
complicated: it will help cases where HOT updates are expected
(because all updates don't modify indexed columns).

It's practically impossible for HOT updates to occur 100% of the time,
even with workloads whose updates never modify indexed columns. You
can clearly see this by looking at the stats from pg_stat_user_tables
with a standard pgbench workload. It does get better with lower heap
fill factor, but I think that HOT is never 100% effective (i.e. 100%
of updates are HOT updates) in the real world -- unless maybe you set
heap fillfactor as low as 50, which is very rare. HOT might well be
95% effective, or 99% effective, but it's never truly 100% effective.
And so this is another important workload where the difference between
"practically zero dead tuples" and "precisely zero dead tuples"
*really* matters when deciding if a VACUUM operation needs to go
ahead.

Once again, a small difference, but also a big difference. Forgive me
for repeating myself do much, but: paying attention to cost/benefit
asymmetries like this one sometimes allow us to recognize an
optimization that is an "excellent deal". We saw this with bottom-up
index deletion. Seems good to keep an eye out for that.

I suppose that 1% threshold should be configurable as a cluster-wide GUC
and also as a table storage parameter?

Possibly. I'm concerned about making any user-visible interface (say a
GUC) compatible with an improved version that is smarter about
bottom-up index deletion (in particular, one that can vacuum only a
subset of the indexes on a table, which now seems too ambitious for
Postgres 14).

--
Peter Geoghegan

#30

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#26)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Feb 2, 2021 at 12:17 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Jan 29, 2021 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote:

It'll be essential to have good instrumentation as we do more
benchmarking. We're probably going to have to make subjective
assessments of benchmark results, based on multiple factors. That will
probably be the only practical way to assess how much better (or
worse) the patch is compared to master. This patch is more about
efficiency and predictability than performance per se. Which is good,
because that's where most of the real world problems actually are.

I've been thinking about how to get this patch committed for
PostgreSQL 14. This will probably require cutting scope, so that the
initial commit is not so ambitious. I think that "incremental VACUUM"
could easily take up a lot of my time for Postgres 15, and maybe even
Postgres 16.

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

I agree to cut the scope. I've also been thinking about the impact of
this patch on users.

I also think we still have a lot of things to consider. For example,
we need to consider and evaluate how incremental vacuum works for
larger tuples or larger fillfactor, etc, and need to discuss more on
the concept of leaving LP_DEAD in the space left by fillfactor is a
good idea or not. Also, we need to discuss the changes in this patch
to nbtree. Since the bottom-up index deletion is a new code for PG14,
in a case where there is a problem in that, this feature could make
things worse since this feature uses it. Perhaps we would need some
safeguard and it also needs time. From that point of view, I think
it’s a good idea to introduce these features to a different major
version. Given the current situation, I agreed that 2 months is too
short to do all things.

Here is a more detailed explanation:

Today we can skip all significant work in ambulkdelete() and
amvacuumcleanup() when there are zero dead tuples in the table. But
why is the threshold *precisely* zero? If we could treat cases that
have "practically zero" dead tuples in the same way (or almost the
same way) as cases with *exactly* zero dead tuple, that's still a big
improvement. And it still sets an important precedent that is crucial
for the wider "incremental VACUUM" project: the criteria for
triggering index vacuuming becomes truly "fuzzy" for the first time.
It is "fuzzy" in the sense that index vacuuming might not happen
during VACUUM at all now, even when the user didn't explicitly use
VACUUUM's INDEX_CLEANUP option, and even when more than *precisely*
zero dead index tuples are involved (though not *much* more than zero,
can't be too aggressive). That really is a big change.

A recap on vacuum_cleanup_index_scale_factor, just to avoid confusion:

The reader should note that this is very different to Masahiko's
vacuum_cleanup_index_scale_factor project, which skips *cleanup* in
VACUUM (not bulk delete), a question which only comes up when there
are definitely zero dead index tuples. The unifying work I'm talking
about now implies that we completely avoid scanning indexes during
vacuum, even when they are known to have at least a few dead index
tuples, and even when VACUUM's INDEX_CLEANUP emergency option is not
in use. Which, as I just said, is a big change.

If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
I'm concerned that this could increase users who are affected by the
known issue of leaking deleted pages. Currently, users who are
affected by that is only users who use INDEX_CLEANUP off. But if we
enable this feature by default, all users potentially are affected by
that issue.

Thoughts on triggering criteria for new "unified" design, ~99.9%
append-only tables:

Actually, in *one* sense the difference between "precisely zero" and
"practically zero" here *is* small. But it's still probably going to
result in skipping reading indexes during VACUUM in many important
cases. Like when you must VACUUM a table that is ~99.9% append-only.
In the real world things are rarely in exact discrete categories, even
when we imagine that they are. It's too easy to be wrong about one
tiny detail -- like one tiny UPDATE from 4 weeks ago, perhaps. Having
a tiny amount of "forgiveness" here is actually a *huge* improvement
on having precisely zero forgiveness. Small and big.

This should help cases that get big surprising spikes due to
anti-wraparound vacuums that must vacuum indexes for the first time in
ages -- indexes may be vacuumed despite only having a tiny absolute
number of dead tuples. I don't think that it's necessary to treat
anti-wraparound vacuums as special at all (not in Postgres 14 and
probably not ever), because simply considering cases where the table
has "practically zero" dead tuples alone should be enough. Vacuuming a
10GB index to delete only 10 tuples simply makes no sense. It doesn't
necessarily matter how we end up there, it just shouldn't happen.

Yeah, doing bulkdelete to delete only 10 tuples makes no sense. It
also dirties caches, which is bad.

To improve index tuple deletion in that case, skipping bulkdelete is
also a good idea whereas the retail index deletion is also a good
solution. I have thought the retail index deletion would be
appropriate to this case but since some index AM cannot support it
skipping index scans is a good solution anyway.

Given that autovacuum won't run on a table that has only 10 dead
tuples, we can think that this case is likely an anti-wraparound case.
So I think that skipping all index scans during VACUUM in only
anti-wraparound case (and if the table has practically zero dead
tuples) could also be an option. This would reduce the opportunity to
skip index scans during vacuum but reduce the risk of leaking deleted
pages in nbtree.

The ~99.9% append-only table case is likely to be important and common
in the real world. We should start there for Postgres 14 because it's
easier, that's all. It's not fundamentally different to what happens
in workloads involving lots of bottom-up deletion -- it's just
simpler, and easier to reason about. Bottom-up deletion is an
important piece of the big puzzle here, but some variant of
"incremental VACUUM" really would still make sense in a world where
bottom-up index deletion does not exist. (In fact, I started thinking
about "incremental VACUUM" before bottom-up index deletion, and didn't
make any connection between them until work on bottom-up deletion had
already progressed significantly.)

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items from the table are to be removed by VACUUM --
whichever is greater. Of course this is just the first thing I thought
of. It's a starting point for further discussion.

I also prefer your second idea :)

My concerns won't be a surprise to you, Masahiko, but I'll list them
for the record. The bottom-up index deletion related complexity that I
want to avoid dealing with for Postgres 14 is in the following areas
(areas that Masahiko's patch dealt with):

* No need to teach indexes to do the amvacuumstrategy() stuff in
Postgres 14 -- so no need to worry about the exact criteria used
within AMs like nbtree to determine whether or not index vacuuming
seems appropriate from the "selfish" perspective of one particular
index.

I'm concerned that factors like bulk DELETEs, that may complicate
things for the amvacuumstrategy() routine -- doing something
relatively simple based on the recent growth of the index might have
downsides. Balancing competing considerations is hard.

Agreed.

* No need to change MaxHeapTuplesPerPage for now, since that only
really makes sense in cases that heavily involve bottom-up deletion,
where we care about the *concentration* of LP_DEAD line pointers in
heap pages (and not just the absolute number in the entire table),
which is qualitative, not quantitative (somewhat like bottom-up
deletion).

The change to MaxHeapTuplesPerPage that Masahiko has proposed does
make sense -- there are good reasons to increase it. Of course there
are also good reasons to not do so. I'm concerned that we won't have
time to think through all the possible consequences.

Agreed.

* Since "practically zero" dead tuples from a table still isn't very
many, the risk of "leaking" many deleted pages due to a known issue
with INDEX_CLEANUP in nbtree [1] is much less significant. (FWIW I
doubt that skipping index vacuuming is the only way that we can fail
to recycle deleted pages anyway -- the FSM is not crash safe, of
course, plus I think that _bt_page_recyclable() might be broken in
other ways.)

As I mentioned above, I'm still concerned that the extent of users who
are affected by the issue of leaking deleted pages could get expanded.
Currently, we don't have a way to detect how many index pages are
leaked. If there are potential cases where many deleted pages are
leaked this feature would make things worse.

In short: we can cut scope and de-risk the patch for Postgres 14 by
following this plan, while still avoiding unnecessary index vacuuming
within VACUUM in certain important cases. The high-level goal for this
patch has always been to recognize that index vacuuming is basically
wasted effort in certain cases. Cutting scope here merely means
addressing the relatively easy cases first, where simple triggering
logic will clearly be effective. I still strongly believe in
"incremental VACUUM".

What do you think of cutting scope like this for Postgres 14,
Masahiko? Sorry to change my mind, but I had to see the prototype to
come to this decision.

I agreed to cut the scope for PG14. It would be good if we could
improve index vacuum while cutting cut the scope for PG14 and not
expanding the extent of the impact of this issue.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#31

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#30)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Feb 3, 2021 at 8:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

I agree to cut the scope. I've also been thinking about the impact of
this patch on users.

It's probably also true that on balance users care more about the
"~99.9% append-only table" case (as well as the HOT updates workload I
brought up in response to Victor on February 2) than making VACUUM
very sensitive to how well bottom-up deletion is working. Yes, it's
annoying that VACUUM still wastes effort on indexes where bottom-up
deletion alone can do all required garbage collection. But that's not
going to be a huge surprise to users. Whereas the "~99.9% append-only
table" case causes huge surprises to users -- users hate this kind of
thing.

If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
I'm concerned that this could increase users who are affected by the
known issue of leaking deleted pages. Currently, users who are
affected by that is only users who use INDEX_CLEANUP off. But if we
enable this feature by default, all users potentially are affected by
that issue.

FWIW I think that it's unfair to blame INDEX_CLEANUP for any problems
in this area. The truth is that the design of the
deleted-page-recycling stuff has always caused leaked pages, even with
workloads that should not be challenging to the implementation in any
way. See my later remarks.

To improve index tuple deletion in that case, skipping bulkdelete is
also a good idea whereas the retail index deletion is also a good
solution. I have thought the retail index deletion would be
appropriate to this case but since some index AM cannot support it
skipping index scans is a good solution anyway.

The big problem with retail index tuple deletion is that it is not
possible once heap pruning takes place (opportunistic pruning, or
pruning performed by VACUUM). Pruning will destroy the information
that retail deletion needs to find the index tuple (the column
values).

I think that we probably will end up using retail index tuple
deletion, but it will only be one strategy among several. We'll never
be able to rely on it, even within nbtree. My personal opinion is that
completely eliminating VACUUM is not a useful long term goal.

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items from the table are to be removed by VACUUM --
whichever is greater. Of course this is just the first thing I thought
of. It's a starting point for further discussion.

I also prefer your second idea :)

Cool. Yeah, I always like it when the potential downside of a design
is obviously low, and the potential upside is obviously very high. I
am much less concerned about any uncertainty around when and how users
will get the big upside. I like it when my problems are "luxury
problems". :-)

As I mentioned above, I'm still concerned that the extent of users who
are affected by the issue of leaking deleted pages could get expanded.
Currently, we don't have a way to detect how many index pages are
leaked. If there are potential cases where many deleted pages are
leaked this feature would make things worse.

The existing problems here were significant even before you added
INDEX_CLEANUP. For example, let's say VACUUM deletes a page, and then
later recycles it in the normal/correct way -- this is the simplest
possible case for page deletion. The page is now in the FSM, ready to
be recycled during the next page split. Or is it? Even in this case
there are no guarantees! This is because _bt_getbuf() does not fully
trust the FSM to give it a 100% recycle-safe page for its page split
caller -- _bt_getbuf() checks the page using _bt_page_recyclable()
(which is the same check that VACUUM does to decide a deleted page is
now recyclable). Obviously this means that the FSM can "leak" a page,
just because there happened to be no page splits before wraparound
occurred (and so now _bt_page_recyclable() thinks the very old page is
very new/in the future).

In general the recycling stuff feels ridiculously over engineered to
me. It is true that page deletion is intrinsically complicated, and is
worth having -- that makes sense to me. But the complexity of the
recycling stuff seems ridiculous.

There is only one sensible solution: follow the example of commit
6655a7299d8 in nbtree. This commit fully fixed exactly the same
problem in GiST by storing an epoch alongside the XID. This nbtree fix
is even anticipated by the commit message of 6655a7299d8. I can take
care of this myself for Postgres 14.

I agreed to cut the scope for PG14. It would be good if we could
improve index vacuum while cutting cut the scope for PG14 and not
expanding the extent of the impact of this issue.

Great! Well, if I take care of the _bt_page_recyclable()
wraparound/epoch issue in a general kind of way then AFAICT there is
no added risk.

--
Peter Geoghegan

#32

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#31)

Re: New IndexAM API controlling index vacuum strategies

On Sat, Feb 6, 2021 at 5:02 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 3, 2021 at 8:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm starting to think that the right short term goal should not
directly involve bottom-up index deletion. We should instead return to
the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
with the INDEX_CLEANUP feature, which is kind of where this whole idea
started out at. This short term goal is much more than mere
refactoring. It is still a whole new user-visible feature. The patch
would teach VACUUM to skip doing any real index work within both
ambulkdelete() and amvacuumcleanup() in many important cases.

I agree to cut the scope. I've also been thinking about the impact of
this patch on users.

It's probably also true that on balance users care more about the
"~99.9% append-only table" case (as well as the HOT updates workload I
brought up in response to Victor on February 2) than making VACUUM
very sensitive to how well bottom-up deletion is working. Yes, it's
annoying that VACUUM still wastes effort on indexes where bottom-up
deletion alone can do all required garbage collection. But that's not
going to be a huge surprise to users. Whereas the "~99.9% append-only
table" case causes huge surprises to users -- users hate this kind of
thing.

Agreed.

If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
I'm concerned that this could increase users who are affected by the
known issue of leaking deleted pages. Currently, users who are
affected by that is only users who use INDEX_CLEANUP off. But if we
enable this feature by default, all users potentially are affected by
that issue.

FWIW I think that it's unfair to blame INDEX_CLEANUP for any problems
in this area. The truth is that the design of the
deleted-page-recycling stuff has always caused leaked pages, even with
workloads that should not be challenging to the implementation in any
way. See my later remarks.

To improve index tuple deletion in that case, skipping bulkdelete is
also a good idea whereas the retail index deletion is also a good
solution. I have thought the retail index deletion would be
appropriate to this case but since some index AM cannot support it
skipping index scans is a good solution anyway.

The big problem with retail index tuple deletion is that it is not
possible once heap pruning takes place (opportunistic pruning, or
pruning performed by VACUUM). Pruning will destroy the information
that retail deletion needs to find the index tuple (the column
values).

Right.

I think that we probably will end up using retail index tuple
deletion, but it will only be one strategy among several. We'll never
be able to rely on it, even within nbtree. My personal opinion is that
completely eliminating VACUUM is not a useful long term goal.

Totally agreed. We are not able to rely on it. It would be a good way
to delete small amount index garbage tuples but the usage is limited.

Here is how the triggering criteria could work: maybe skipping
accessing all indexes during VACUUM happens when less than 1% or
10,000 of the items from the table are to be removed by VACUUM --
whichever is greater. Of course this is just the first thing I thought
of. It's a starting point for further discussion.

I also prefer your second idea :)

Cool. Yeah, I always like it when the potential downside of a design
is obviously low, and the potential upside is obviously very high. I
am much less concerned about any uncertainty around when and how users
will get the big upside. I like it when my problems are "luxury
problems". :-)

As I mentioned above, I'm still concerned that the extent of users who
are affected by the issue of leaking deleted pages could get expanded.
Currently, we don't have a way to detect how many index pages are
leaked. If there are potential cases where many deleted pages are
leaked this feature would make things worse.

The existing problems here were significant even before you added
INDEX_CLEANUP. For example, let's say VACUUM deletes a page, and then
later recycles it in the normal/correct way -- this is the simplest
possible case for page deletion. The page is now in the FSM, ready to
be recycled during the next page split. Or is it? Even in this case
there are no guarantees! This is because _bt_getbuf() does not fully
trust the FSM to give it a 100% recycle-safe page for its page split
caller -- _bt_getbuf() checks the page using _bt_page_recyclable()
(which is the same check that VACUUM does to decide a deleted page is
now recyclable). Obviously this means that the FSM can "leak" a page,
just because there happened to be no page splits before wraparound
occurred (and so now _bt_page_recyclable() thinks the very old page is
very new/in the future).

In general the recycling stuff feels ridiculously over engineered to
me. It is true that page deletion is intrinsically complicated, and is
worth having -- that makes sense to me. But the complexity of the
recycling stuff seems ridiculous.

There is only one sensible solution: follow the example of commit
6655a7299d8 in nbtree. This commit fully fixed exactly the same
problem in GiST by storing an epoch alongside the XID. This nbtree fix
is even anticipated by the commit message of 6655a7299d8. I can take
care of this myself for Postgres 14.

Thanks. I think that's very good if we resolve this recycling stuff
first then try the new approach to skip index vacuum in more cases.
That way, even if the vacuum strategy stuff took a very long time to
get committed over several major versions, we would not be affected by
deleted nbtree page recycling problem (at least for built-in index
AMs). Also, the approach of 6655a7299d8 itself is a good improvement
and seems straightforward to me.

I agreed to cut the scope for PG14. It would be good if we could
improve index vacuum while cutting cut the scope for PG14 and not
expanding the extent of the impact of this issue.

Great! Well, if I take care of the _bt_page_recyclable()
wraparound/epoch issue in a general kind of way then AFAICT there is
no added risk.

Agreed!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#33

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#32)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Feb 9, 2021 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks. I think that's very good if we resolve this recycling stuff
first then try the new approach to skip index vacuum in more cases.
That way, even if the vacuum strategy stuff took a very long time to
get committed over several major versions, we would not be affected by
deleted nbtree page recycling problem (at least for built-in index
AMs). Also, the approach of 6655a7299d8 itself is a good improvement
and seems straightforward to me.

I'm glad that you emphasized this issue, because I came up with a
solution that turns out to not be very invasive. At the same time it
has unexpected advantages, liking improving amcheck coverage for
deleted pages.

--
Peter Geoghegan

#34

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#33)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Feb 10, 2021 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Feb 9, 2021 at 6:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks. I think that's very good if we resolve this recycling stuff
first then try the new approach to skip index vacuum in more cases.
That way, even if the vacuum strategy stuff took a very long time to
get committed over several major versions, we would not be affected by
deleted nbtree page recycling problem (at least for built-in index
AMs). Also, the approach of 6655a7299d8 itself is a good improvement
and seems straightforward to me.

I'm glad that you emphasized this issue, because I came up with a
solution that turns out to not be very invasive. At the same time it
has unexpected advantages, liking improving amcheck coverage for
deleted pages.

Sorry for the late response.

I've attached the patch that adds a check whether or not to do index
vacuum (and heap vacuum) if 1% of all heap pages have LP_DEAD line
pointer.

While developing this feature, I realized the following two things:

1. Whereas skipping index vacuum and heap vacuum is a very attractive
improvement, if we skip that by default I wonder if we need a way to
disable it. Vacuum plays a role in cleaning and diagnosing tables in
practice. So in a case where the table is bad state and the user wants
to clean all heap pages, it would be good to have a way to disable
this skipping behavior. One solution would be that index_cleanup
option has three different behaviors: on, auto (or smart), and off. We
enable this skipping behavior by default in ‘auto’ mode, but
specifying "INDEX_CLEANUP true” means to enforce index vacuum and
therefore disabling it.

---
2.
@@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
{
lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
all_visible = false;
+ has_dead_tuples = true;
continue;
}

I added the above change in the patch to count the number of heap
pages having at least one LP_DEAD line pointer. But it's weird to me
that we have never set has_dead_tuples true when we found an LP_DEAD
line pointer. Currently, we set it to false true in 'tupgone' case but
it seems to me that we should do that in this case as well since we
use this flag in the following check:

else if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page containing dead tuples is marked as
all-visible in relation \"%s\" page %u",
vacrelstats->relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer,
VISIBILITYMAP_VALID_BITS);
}

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

skip_index_vacuum.patchapplication/octet-stream; name=skip_index_vacuum.patchDownload

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bb78162f5..25c747df7f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -130,6 +130,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * to trigger both table vacuum and index vacuum.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -343,6 +349,10 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void lazy_vacuum_table_and_indexes(Relation onerel, LVRelStats *vacrelstats,
+										  Relation *Irel, IndexBulkDeleteResult **indstats,
+										  int nindexes, LVParallelState *lps,
+										  double *npages_deadlp);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -768,6 +778,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				tups_vacuumed,	/* tuples cleaned up by current vacuum */
 				nkeep,			/* dead-but-not-removable tuples */
 				nunused;		/* # existing unused line pointers */
+	double		npages_deadlp;
 	IndexBulkDeleteResult **indstats;
 	int			i;
 	PGRUsage	ru0;
@@ -800,6 +811,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	empty_pages = vacuumed_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+	npages_deadlp = 0;
 
 	indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -1050,23 +1062,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum_table_and_indexes(onerel, vacrelstats, Irel, indstats,
+										  nindexes, lps, &npages_deadlp);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be worthwhile
+			 * since we could have updated the freespace of empty pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			{
 				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
 				all_visible = false;
+				has_dead_tuples = true;
 				continue;
 			}
 
@@ -1658,6 +1663,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		if (hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
 
+		/* Remember the number of pages having at least one LP_DEAD line pointer */
+		if (has_dead_tuples)
+			npages_deadlp += 1;
+
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
 		 * visited again by lazy_vacuum_heap, which will compute and record
@@ -1704,16 +1713,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_table_and_indexes(onerel, vacrelstats, Irel, indstats,
+									  nindexes, lps, &npages_deadlp);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1775,6 +1777,52 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(buf.data);
 }
 
+
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_table_and_indexes(Relation onerel, LVRelStats *vacrelstats,
+							  Relation *Irel, IndexBulkDeleteResult **indstats,
+							  int nindexes, LVParallelState *lps,
+							  double *npages_deadlp)
+{
+	/*
+	 * Check whether or not to do index vacuum and heap vacuum.
+	 *
+	 * We do both index vacuum and heap vacuum if more than
+	 * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+	 * line pointer.  This is normally a case where dead tuples on the heap
+	 * are highly concentrated in relatively few heap blocks, where the
+	 * index's enhanced deletion mechanism that is clever about heap block
+	 * dead tuple concentrations including btree's bottom-up index deletion
+	 * works well.  Also, since we can clean only a few heap blocks, it would
+	 * be a less negative impact in terms of visibility map update.
+	 *
+	 * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+	 * vacrelstats->dead_tuples could have tuples which became dead after
+	 * HOT-pruning but are not marked dead yet.  We do not process them because
+	 * it's a very rare condition, and the next vacuum will process them anyway.
+	 */
+	if (*npages_deadlp > RelationGetNumberOfBlocks(onerel) * SKIP_VACUUM_PAGES_RATIO)
+	{
+		/* Work on all the indexes, then the heap */
+		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps, nindexes);
+
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+	*npages_deadlp = 0;
+}
+
+
 /*
  *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
  *

#35

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#34)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Feb 21, 2021 at 10:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry for the late response.

Me too!

1. Whereas skipping index vacuum and heap vacuum is a very attractive
improvement, if we skip that by default I wonder if we need a way to
disable it. Vacuum plays a role in cleaning and diagnosing tables in
practice. So in a case where the table is bad state and the user wants
to clean all heap pages, it would be good to have a way to disable
this skipping behavior. One solution would be that index_cleanup
option has three different behaviors: on, auto (or smart), and off. We
enable this skipping behavior by default in ‘auto’ mode, but
specifying "INDEX_CLEANUP true” means to enforce index vacuum and
therefore disabling it.

Sounds reasonable to me. Maybe users should express the skipping
behavior that they desire in terms of the *proportion* of all heap
blocks with LP_DEAD line pointers that we're willing to have while
still skipping index vacuuming + lazy_vacuum_heap() heap scan. In
other words, it can be a "scale" type GUC/param (though based on heap
blocks *at the end* of the first heap scan, not tuples at the point
the av launcher considers launching AV workers).

@@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
{
lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
all_visible = false;
+ has_dead_tuples = true;
continue;
}

I added the above change in the patch to count the number of heap
pages having at least one LP_DEAD line pointer. But it's weird to me
that we have never set has_dead_tuples true when we found an LP_DEAD
line pointer.

I think that you're right. However, in practice it isn't harmful
because has_dead_tuples is only used when "all_visible = true", and
only to detect corruption (which should never happen). I think that it
should be fixed as part of this work, though.

Lots of stuff in this area is kind of weird already. Sometimes this is
directly exposed to users, even. This came up recently, when I was
working on VACUUM VERBOSE stuff. (This touched the precise piece of
code you've patched in the quoted diff snippet, so perhaps you know
some of the story I will tell you now already.)

I recently noticed that VACUUM VERBOSE can report a very low
tups_vacuumed/"removable heap tuples" when run against tables where
most pruning is opportunistic pruning rather than VACUUM pruning
(which is very common), provided there are no HOT updates (which is
common but not very common). This can be very confusing, because
VACUUM VERBOSE will report a "tuples_deleted" for the heap relation
that is far far less than the "tuples_removed" it reports for indexes
on the same table -- even though both fields have values that are
technically accurate (e.g., not very affected by concurrent activity
during VACUUM, nothing like that).

This came to my attention when I was running BenchmarkSQL for the
64-bit XID deleted pages patch. One of the BenchmarkSQL tables (though
only one -- the table whose UPDATEs are not HOT safe, which is unique
among BenchmarkSQL/TPC-C tables). I pushed a commit with comment
changes [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7cde6b13a9b630e2f04d91e2f17dedc2afee21c6 to make that aspect of VACUUM VERBOSE a little less
confusing. (I was actually running a quick-and-dirty hack that made
log_autovacuum show VACUUM VERBOSE index stuff -- I would probably
have missed the weird difference between heap tups_vacuumed and index
tuples_removed without this custom log_autovacuum hack.)

Just to be clear (I think you agree already): we should base any
triggering logic for skipping index vacuuming/lazy_vacuum_heap() on
logic that does not care *when* heap pages first contained LP_DEAD
line pointers (could be that they were counted in tups_vacuumed due to
being pruned during this VACUUM operation, could be from an earlier
opportunistic pruning, etc).

Currently, we set it to false true in 'tupgone' case but
it seems to me that we should do that in this case as well since we
use this flag in the following check:

else if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page containing dead tuples is marked as
all-visible in relation \"%s\" page %u",
vacrelstats->relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer,
VISIBILITYMAP_VALID_BITS);
}

The "tupgone = true"/HEAPTUPLE_DEAD-race case is *extremely* weird. It
has zero test coverage according to coverage.postgresql.org [2]https://coverage.postgresql.org/src/backend/access/heap/vacuumlazy.c.gcov.html -- Peter Geoghegan,
despite being very complicated.

3 points on the "tupgone = true" weirdness (I'm writing this as a
record for myself, almost):

1. It is the reason why lazy_vacuum_heap() must be prepared to set
tuples LP_UNUSED that are not already LP_DEAD. So when
lazy_vacuum_page() says "the first dead tuple for this page", that
doesn't necessarily mean LP_DEAD items! (Though the other cases are
not even tested, I think -- the lack of "tupgone = true" test coverage
also means we don't cover corresponding lazy_vacuum_page() cases.)

2. This is also why we need XLOG_HEAP2_CLEANUP_INFO records (i.e. why
XLOG_HEAP2_CLEAN records are not sufficient to all required recovery
conflicts during VACUUM).

3. And it's also why log_heap_clean() is needed for both
lazy_scan_heap()'s pruning and lazy_vacuum_heap() unused-marking.

Many years ago, Noah Misch tried to clean this up -- that included
renaming lazy_vacuum_heap() to lazy_heap_clear_dead_items(), which
would only deal with LP_DEAD items:

/messages/by-id/20130108024957.GA4751@tornado.leadboat.com

Of course, this effort to eliminate the "tupgone =
true"/XLOG_HEAP2_CLEANUP_INFO special case didn't go anywhere at the
time.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7cde6b13a9b630e2f04d91e2f17dedc2afee21c6
[2]: https://coverage.postgresql.org/src/backend/access/heap/vacuumlazy.c.gcov.html -- Peter Geoghegan
--
Peter Geoghegan

#36

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#35)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 1, 2021 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that you're right. However, in practice it isn't harmful
because has_dead_tuples is only used when "all_visible = true", and
only to detect corruption (which should never happen). I think that it
should be fixed as part of this work, though.

Currently the first callsite that calls the new
lazy_vacuum_table_and_indexes() function in the patch
("skip_index_vacuum.patch") skips index vacuuming in exactly the same
way as the second and final lazy_vacuum_table_and_indexes() call site.
Don't we need to account for maintenance_work_mem in some way?

lazy_vacuum_table_and_indexes() should probably not skip index
vacuuming when we're close to exceeding the space allocated for the
LVDeadTuples array. Maybe we should not skip when
vacrelstats->dead_tuples->num_tuples is greater than 50% of
dead_tuples->max_tuples? Of course, this would only need to be
considered when lazy_vacuum_table_and_indexes() is only called once
for the entire VACUUM operation (otherwise we have far too little
maintenance_work_mem/dead_tuples->max_tuples anyway).

--
Peter Geoghegan

#37

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#35)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 2, 2021 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Feb 21, 2021 at 10:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry for the late response.

Me too!

No problem, thank you for your comment!

1. Whereas skipping index vacuum and heap vacuum is a very attractive
improvement, if we skip that by default I wonder if we need a way to
disable it. Vacuum plays a role in cleaning and diagnosing tables in
practice. So in a case where the table is bad state and the user wants
to clean all heap pages, it would be good to have a way to disable
this skipping behavior. One solution would be that index_cleanup
option has three different behaviors: on, auto (or smart), and off. We
enable this skipping behavior by default in ‘auto’ mode, but
specifying "INDEX_CLEANUP true” means to enforce index vacuum and
therefore disabling it.

Sounds reasonable to me. Maybe users should express the skipping
behavior that they desire in terms of the *proportion* of all heap
blocks with LP_DEAD line pointers that we're willing to have while
still skipping index vacuuming + lazy_vacuum_heap() heap scan. In
other words, it can be a "scale" type GUC/param (though based on heap
blocks *at the end* of the first heap scan, not tuples at the point
the av launcher considers launching AV workers).

A scale type parameter seems good to me but I wonder if how users can
tune that parameter. We already have tuple-based parameters such as
autovacuum_vacuum_scale_factor/threshold and I think that users
basically don't pay attention to that table updates result in how many
blocks.

And I'm concerned that my above idea could confuse users since what we
want to control is both heap vacuum and index vacuum but it looks like
controlling only index vacuum.

The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
to disable such skipping behavior. I imagine that the
user-controllable-option to enforce both heap vacuum and index vacuum
would be required also in the future when we have the vacuum strategy
feature (i.g., incremental vacuum).

@@ -1299,6 +1303,7 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
{
lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
all_visible = false;
+ has_dead_tuples = true;
continue;
}

I added the above change in the patch to count the number of heap
pages having at least one LP_DEAD line pointer. But it's weird to me
that we have never set has_dead_tuples true when we found an LP_DEAD
line pointer.

I think that you're right. However, in practice it isn't harmful
because has_dead_tuples is only used when "all_visible = true", and
only to detect corruption (which should never happen). I think that it
should be fixed as part of this work, though.

Agreed.

Lots of stuff in this area is kind of weird already. Sometimes this is
directly exposed to users, even. This came up recently, when I was
working on VACUUM VERBOSE stuff. (This touched the precise piece of
code you've patched in the quoted diff snippet, so perhaps you know
some of the story I will tell you now already.)

I recently noticed that VACUUM VERBOSE can report a very low
tups_vacuumed/"removable heap tuples" when run against tables where
most pruning is opportunistic pruning rather than VACUUM pruning
(which is very common), provided there are no HOT updates (which is
common but not very common). This can be very confusing, because
VACUUM VERBOSE will report a "tuples_deleted" for the heap relation
that is far far less than the "tuples_removed" it reports for indexes
on the same table -- even though both fields have values that are
technically accurate (e.g., not very affected by concurrent activity
during VACUUM, nothing like that).

This came to my attention when I was running BenchmarkSQL for the
64-bit XID deleted pages patch. One of the BenchmarkSQL tables (though
only one -- the table whose UPDATEs are not HOT safe, which is unique
among BenchmarkSQL/TPC-C tables). I pushed a commit with comment
changes [1] to make that aspect of VACUUM VERBOSE a little less
confusing. (I was actually running a quick-and-dirty hack that made
log_autovacuum show VACUUM VERBOSE index stuff -- I would probably
have missed the weird difference between heap tups_vacuumed and index
tuples_removed without this custom log_autovacuum hack.)

That's true. I didn't know that.

Just to be clear (I think you agree already): we should base any
triggering logic for skipping index vacuuming/lazy_vacuum_heap() on
logic that does not care *when* heap pages first contained LP_DEAD
line pointers (could be that they were counted in tups_vacuumed due to
being pruned during this VACUUM operation, could be from an earlier
opportunistic pruning, etc).

Agreed. We should base only the fact that the page contains LP_DEAD.

Currently, we set it to false true in 'tupgone' case but
it seems to me that we should do that in this case as well since we
use this flag in the following check:

else if (PageIsAllVisible(page) && has_dead_tuples)
{
elog(WARNING, "page containing dead tuples is marked as
all-visible in relation \"%s\" page %u",
vacrelstats->relname, blkno);
PageClearAllVisible(page);
MarkBufferDirty(buf);
visibilitymap_clear(onerel, blkno, vmbuffer,
VISIBILITYMAP_VALID_BITS);
}

The "tupgone = true"/HEAPTUPLE_DEAD-race case is *extremely* weird. It
has zero test coverage according to coverage.postgresql.org [2],
despite being very complicated.

3 points on the "tupgone = true" weirdness (I'm writing this as a
record for myself, almost):

1. It is the reason why lazy_vacuum_heap() must be prepared to set
tuples LP_UNUSED that are not already LP_DEAD. So when
lazy_vacuum_page() says "the first dead tuple for this page", that
doesn't necessarily mean LP_DEAD items! (Though the other cases are
not even tested, I think -- the lack of "tupgone = true" test coverage
also means we don't cover corresponding lazy_vacuum_page() cases.)

2. This is also why we need XLOG_HEAP2_CLEANUP_INFO records (i.e. why
XLOG_HEAP2_CLEAN records are not sufficient to all required recovery
conflicts during VACUUM).

3. And it's also why log_heap_clean() is needed for both
lazy_scan_heap()'s pruning and lazy_vacuum_heap() unused-marking.

Many years ago, Noah Misch tried to clean this up -- that included
renaming lazy_vacuum_heap() to lazy_heap_clear_dead_items(), which
would only deal with LP_DEAD items:

/messages/by-id/20130108024957.GA4751@tornado.leadboat.com

Of course, this effort to eliminate the "tupgone =
true"/XLOG_HEAP2_CLEANUP_INFO special case didn't go anywhere at the
time.

I'll look at that thread.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#38

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#37)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 2, 2021 at 6:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A scale type parameter seems good to me but I wonder if how users can
tune that parameter. We already have tuple-based parameters such as
autovacuum_vacuum_scale_factor/threshold and I think that users
basically don't pay attention to that table updates result in how many
blocks.

Fair. The scale thing was just a random suggestion, nothing to take
too seriously.

The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
to disable such skipping behavior. I imagine that the
user-controllable-option to enforce both heap vacuum and index vacuum
would be required also in the future when we have the vacuum strategy
feature (i.g., incremental vacuum).

Yeah, I'm worried about conflicting requirements here -- this patch
and the next patch (that pushes the same ideas further) might have
different requirements.

I think that this patch will mostly be useful in cases where there are
very few LP_DEAD-containing heap pages, but consistently more than
zero. So it's probably not easy to tune.

What we might want is an on/off switch. But why? DISABLE_PAGE_SKIPPING
was added because the freeze map work in 9.6 was considered high risk
at the time, and we needed to have a tool to manage that risk. But
this patch doesn't seem nearly as tricky. No?

Lots of stuff in this area is kind of weird already. Sometimes this is
directly exposed to users, even. This came up recently, when I was
working on VACUUM VERBOSE stuff.

That's true. I didn't know that.

It occurs to me that "tups_vacuumed vs. total LP_DEAD Items in heap
after VACUUM finishes" is similar to "pages_newly_deleted vs.
pages_deleted" for indexes. An easy mistake to make!

/messages/by-id/20130108024957.GA4751@tornado.leadboat.com

Of course, this effort to eliminate the "tupgone =
true"/XLOG_HEAP2_CLEANUP_INFO special case didn't go anywhere at the
time.

I'll look at that thread.

I'm not sure if it's super valuable to look at the thread. But it is
reassuring to see that Noah shared the intuition that the "tupgone =
true" case was kind of bad, even back in 2013. It's one part of my
"mental map" of VACUUM.

--
Peter Geoghegan

#39

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#36)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 1, 2021 at 7:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that you're right. However, in practice it isn't harmful
because has_dead_tuples is only used when "all_visible = true", and
only to detect corruption (which should never happen). I think that it
should be fixed as part of this work, though.

Currently the first callsite that calls the new
lazy_vacuum_table_and_indexes() function in the patch
("skip_index_vacuum.patch") skips index vacuuming in exactly the same
way as the second and final lazy_vacuum_table_and_indexes() call site.
Don't we need to account for maintenance_work_mem in some way?

lazy_vacuum_table_and_indexes() should probably not skip index
vacuuming when we're close to exceeding the space allocated for the
LVDeadTuples array. Maybe we should not skip when
vacrelstats->dead_tuples->num_tuples is greater than 50% of
dead_tuples->max_tuples? Of course, this would only need to be
considered when lazy_vacuum_table_and_indexes() is only called once
for the entire VACUUM operation (otherwise we have far too little
maintenance_work_mem/dead_tuples->max_tuples anyway).

Doesn't it actually mean we consider how many dead *tuples* we
collected during a vacuum? I’m not sure how important the fact we’re
close to exceeding the maintenance_work_mem space. Suppose
maintenance_work_mem is 64MB, we will not skip both index vacuum and
heap vacuum if the number of dead tuples exceeds 5592404 (we can
collect 11184809 tuples with 64MB memory). But those tuples could be
concentrated in a small number of blocks, for example in a very large
table case. It seems to contradict the current strategy that we want
to skip vacuum if relatively few blocks are modified. No?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#40

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#26)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Feb 1, 2021 at 10:17 PM Peter Geoghegan <pg@bowt.ie> wrote:

* No need to change MaxHeapTuplesPerPage for now, since that only
really makes sense in cases that heavily involve bottom-up deletion,
where we care about the *concentration* of LP_DEAD line pointers in
heap pages (and not just the absolute number in the entire table),
which is qualitative, not quantitative (somewhat like bottom-up
deletion).

The change to MaxHeapTuplesPerPage that Masahiko has proposed does
make sense -- there are good reasons to increase it. Of course there
are also good reasons to not do so. I'm concerned that we won't have
time to think through all the possible consequences.

Yes, I agree that it's good to postpone this to a future release, and
that thinking through the consequences is not so easy. One possible
consequence that I'm concerned about is sequential scan performance.
For an index scan, you just jump to the line pointer you want and then
go get the tuple, but a sequential scan has to loop over all the line
pointers on the page, and skipping a lot of dead ones can't be
completely free. A small increase in MaxHeapTuplesPerPage probably
wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
is a bit scary. It's also a little hard to believe that letting almost
50% of the total space on the page get chewed up by the line pointer
array is going to be optimal. If that happens to every page while the
amount of data stays the same, the table must almost double in size.
That's got to be bad. The whole thing would be more appealing if there
were some way to exert exponentially increasing back-pressure on the
length of the line pointer array - that is, make it so that the longer
the array is already, the less willing we are to extend it further.
But I don't really see how to do that.

Also, at the risk of going on and on, line pointer array bloat is very
hard to eliminate once it happens. We never even try to shrink the
line pointer array, and if the last TID in the array is still in use,
it wouldn't be possible anyway, assuming the table has at least one
non-BRIN index. Index page splits are likewise irreversible, but
creating a new index and dropping the old one is still less awful than
having to rewrite the table.

Another thing to consider is that MaxHeapTuplesPerPage is used to size
some stack-allocated arrays, especially the stack-allocated
PruneState. I thought for a while about this and I can't really see
why it would be a big problem, even with a large increase in
MaxHeapTuplesPerPage, so I'm just mentioning this in case it makes
somebody else think of something I've missed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#41

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#40)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 8, 2021 at 10:57 AM Robert Haas <robertmhaas@gmail.com> wrote:

Yes, I agree that it's good to postpone this to a future release, and
that thinking through the consequences is not so easy.

The current plan is to commit something like Masahiko's
skip_index_vacuum.patch for Postgres 14. The latest version of that
patch (a reduced-scope version of Masahiko's patch without any changes
to MaxHeapTuplesPerPage) is available from:

/messages/by-id/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com

The idea is to "unify the vacuum_cleanup_index_scale_factor feature
from Postgres 11 with the INDEX_CLEANUP feature from Postgres 12".
This is the broader plan to make that "unification" happen for
Postgres 14:

/messages/by-id/CAH2-WzkYaDdbWOEwSSmC65FzF_jRLq-cxrYtt-2+ASoA156X=w@mail.gmail.com

So, as I said, any change to MaxHeapTuplesPerPage is now out of scope
for Postgres 14.

One possible
consequence that I'm concerned about is sequential scan performance.
For an index scan, you just jump to the line pointer you want and then
go get the tuple, but a sequential scan has to loop over all the line
pointers on the page, and skipping a lot of dead ones can't be
completely free. A small increase in MaxHeapTuplesPerPage probably
wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
is a bit scary.

I agree. Maybe the real problem here is that MaxHeapTuplesPerPage is a
generic constant. Perhaps it should be something that can vary by
table, according to practical table-level considerations such as
projected tuple width given the "shape" of tuples for that table, etc.

Certain DB systems that use bitmap indexes extensively allow this to
be configured per-table. If you need to encode a bunch of TIDs as
bitmaps, you first need some trivial mapping from TIDs to integers
(before you even build the bitmap, much less compress it). So even
without VACUUM there is a trade-off to be made. It is *roughly*
comparable to the trade-off you make when deciding on a page size.

What I really want to do for Postgres 14 is to establish the principle
that index vacuuming is theoretically optional -- in all cases. There
will be immediate practical benefits, too. I think it's important to
remove the artificial behavioral differences between cases where there
are 0 dead tuples and cases where there is only 1. My guess is that
99%+ append-only tables are far more common than 100% append-only
tables in practice.

It's also a little hard to believe that letting almost
50% of the total space on the page get chewed up by the line pointer
array is going to be optimal. If that happens to every page while the
amount of data stays the same, the table must almost double in size.
That's got to be bad.

I think that we should be prepared for a large diversity of conditions
within a given table. It follows that we should try to be adaptive.

The reduced-scope patch currently tracks LP_DEAD line pointers at the
heap page level, and then applies a count of heap blocks with one or
more LP_DEAD line pointers (could be existing or just pruned by this
VACUUM) to determine a count of heap pages. This is used to determine
a threshold at which index vacuuming should be forced. Currently we
have a multiplier constant called SKIP_VACUUM_PAGES_RATIO, which is
0.01 -- 1% of heap blocks. Of course, it's possible that LP_DEAD line
pointers will be very concentrated, in which case we're more
aggressive about skipping index vacuuming (if you think of it in terms
of dead TIDs instead of heap blocks we're aggressive, that is). The
other extreme exists too: LP_DEAD line pointers may instead be spread
diffusively across all heap pages, in which case we are unlikely to
ever skip index vacuuming outside of cases like anti-wraparound vacuum
or insert-driven vacuum to set VM bits.

The next iteration of the high-level "granular vacuum" project (which
will presumably target Postgres 15) should probably involve more
complicated, qualitative judgements about LP_DEAD line pointers in the
heap. Likewise it should care about individual needs of indexes, which
is something that Masahiko experimented with in earlier drafts of the
patch on this thread. The needs of each index can be quite different
with bottom-up index deletion. We may in fact end up adding a new,
moderately complicated cost model -- it may have to be modelled as an
optimization problem.

In short, I think that thinking more about the logical state of the
database during VACUUM is likely to pay-off ("heap blocks vs dead
tuples" is one part of that). VACUUM should be a little more
qualitative, and a little less quantitative. The fact that we
currently don't stuff like that (unless bottom-up index deletion
counts) is not an inherent limitation of the design of VACUUM. I'm not
entirely sure how far it can be pushed, but it seems quite promising.

The whole thing would be more appealing if there
were some way to exert exponentially increasing back-pressure on the
length of the line pointer array - that is, make it so that the longer
the array is already, the less willing we are to extend it further.
But I don't really see how to do that.

There are also related problems in the FSM, which just doesn't care
enough about preserving the original layout of tables over time. See
for example the recent "non-HOT update not looking at FSM for large
tuple update" thread. I believe that the aggregate effect of
inefficiencies like that are a real problem for us. The basic design
of the FSM hasn't been touched in over 10 years. There are non-linear
effects in play, in all likelihood. "Rare" harmful events (e.g. silly
space reuse in the heap, unnecessary page splits from version churn)
will tend to cause irreversible damage to locality of access if
allowed to occur at all. So we need to recognize those heap pages
where it's possible to preserve a kind of pristine state over time.
Heap pages that are subject to constant churn from updates exist --
most apps have some of those. But they're also usually a small
minority of all heap pages, even within the same heap relation.

--
Peter Geoghegan

#42

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#39)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 2, 2021 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

lazy_vacuum_table_and_indexes() should probably not skip index
vacuuming when we're close to exceeding the space allocated for the
LVDeadTuples array. Maybe we should not skip when
vacrelstats->dead_tuples->num_tuples is greater than 50% of
dead_tuples->max_tuples? Of course, this would only need to be
considered when lazy_vacuum_table_and_indexes() is only called once
for the entire VACUUM operation (otherwise we have far too little
maintenance_work_mem/dead_tuples->max_tuples anyway).

Doesn't it actually mean we consider how many dead *tuples* we
collected during a vacuum? I’m not sure how important the fact we’re
close to exceeding the maintenance_work_mem space. Suppose
maintenance_work_mem is 64MB, we will not skip both index vacuum and
heap vacuum if the number of dead tuples exceeds 5592404 (we can
collect 11184809 tuples with 64MB memory). But those tuples could be
concentrated in a small number of blocks, for example in a very large
table case. It seems to contradict the current strategy that we want
to skip vacuum if relatively few blocks are modified. No?

There are competing considerations. I think that we need to be
sensitive to accumulating "debt" here. The cost of index vacuuming
grows in a non-linear fashion as the index grows (or as
maintenance_work_mem is lowered). This is the kind of thing that we
should try to avoid, I think. I suspect that cases where we can skip
index vacuuming and heap vacuuming are likely to involve very few dead
tuples in most cases anyway.

We should not be sensitive to the absolute number of dead tuples when
it doesn't matter (say because they're concentrated in relatively few
heap pages). But when we overrun the maintenance_work_mem space, then
the situation changes; the number of dead tuples clearly matters just
because we run out of space for the TID array. The heap page level
skew is not really important once that happens.

That said, maybe there is a better algorithm. 50% was a pretty arbitrary number.

Have you thought more about how the index vacuuming skipping can be
configured by users? Maybe a new storage param, that works like the
current SKIP_VACUUM_PAGES_RATIO constant?

--
Peter Geoghegan

#43

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#41)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 8, 2021 at 7:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

One possible
consequence that I'm concerned about is sequential scan performance.
For an index scan, you just jump to the line pointer you want and then
go get the tuple, but a sequential scan has to loop over all the line
pointers on the page, and skipping a lot of dead ones can't be
completely free. A small increase in MaxHeapTuplesPerPage probably
wouldn't matter, but the proposed increase of almost 10x (291 -> 2042)
is a bit scary.

I agree. Maybe the real problem here is that MaxHeapTuplesPerPage is a
generic constant. Perhaps it should be something that can vary by
table, according to practical table-level considerations such as
projected tuple width given the "shape" of tuples for that table, etc.

Speaking of line pointer bloat (and "irreversible" bloat), I came
across something relevant today. I believe that this recent patch from
Matthias van de Meent is a relatively easy way to improve the
situation:

/messages/by-id/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com

--
Peter Geoghegan

#44

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#38)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 3, 2021 at 12:40 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 2, 2021 at 6:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A scale type parameter seems good to me but I wonder if how users can
tune that parameter. We already have tuple-based parameters such as
autovacuum_vacuum_scale_factor/threshold and I think that users
basically don't pay attention to that table updates result in how many
blocks.

Fair. The scale thing was just a random suggestion, nothing to take
too seriously.

The third idea is a VACUUM command option like DISABLE_PAGE_SKIPPING
to disable such skipping behavior. I imagine that the
user-controllable-option to enforce both heap vacuum and index vacuum
would be required also in the future when we have the vacuum strategy
feature (i.g., incremental vacuum).

Yeah, I'm worried about conflicting requirements here -- this patch
and the next patch (that pushes the same ideas further) might have
different requirements.

I think that this patch will mostly be useful in cases where there are
very few LP_DEAD-containing heap pages, but consistently more than
zero. So it's probably not easy to tune.

What we might want is an on/off switch. But why? DISABLE_PAGE_SKIPPING
was added because the freeze map work in 9.6 was considered high risk
at the time, and we needed to have a tool to manage that risk. But
this patch doesn't seem nearly as tricky. No?

I think the motivation behind on/off switch is similar. I was
concerned about a case where there is a bug or something so that we
mistakenly skip vacuums on heap and indexes. But this feature would
not be as complicated as freeze map and only skips the part of
changing LP_DEAD to LP_UNUSED I agree it seems not to be essential.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#45

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#43)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 9, 2021 at 3:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Speaking of line pointer bloat (and "irreversible" bloat), I came
across something relevant today. I believe that this recent patch from
Matthias van de Meent is a relatively easy way to improve the
situation:

/messages/by-id/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com

I agree, but all you need is one long-lived tuple toward the end of
the array and you're stuck never being able to truncate it. It seems
like a worthwhile improvement, but whether it actually helps will be
workload-dependant.

Maybe it'd be OK to allow a much longer array with offsets > some
constant being usable only for HOT. HOT tuples are not indexed, so it
might be easier to rearrange things to allow compaction of the array
if it does happen to get fragmented. But I'm not sure it's OK to
relocate even a HOT tuple to a different TID. Can someone, perhaps
even just the user, still have a reference to the old one and care
about us invalidating it? Maybe. But even if not, I'm not sure this
helps much with the situation you're concerned about, which involves
non-HOT tuples.

--
Robert Haas
EDB: http://www.enterprisedb.com

#46

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 5 years ago

In reply to: Robert Haas (#45)

Re: New IndexAM API controlling index vacuum strategies

On Thu, 11 Mar 2021 at 17:31, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 9, 2021 at 3:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Speaking of line pointer bloat (and "irreversible" bloat), I came
across something relevant today. I believe that this recent patch from
Matthias van de Meent is a relatively easy way to improve the
situation:

/messages/by-id/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com

I agree, but all you need is one long-lived tuple toward the end of
the array and you're stuck never being able to truncate it. It seems
like a worthwhile improvement, but whether it actually helps will be
workload-dependant.

Maybe it'd be OK to allow a much longer array with offsets > some
constant being usable only for HOT. HOT tuples are not indexed, so it
might be easier to rearrange things to allow compaction of the array
if it does happen to get fragmented. But I'm not sure it's OK to
relocate even a HOT tuple to a different TID.

I'm currently trying to work out how to shuffle HOT tuples around as
an extension on top of my heap->pd_lower patch, and part of that will
be determining when and how HOT tuples are exposed internally. I'm
probably going to need to change how they are referenced to get that
working (current concept: HOT root TID + transaction identifier for
the places that need more than 1 item in HOT chains), but its a very
bare-bones prototype currently only generating the data record
nescessary to shuffle the item pointers.

In that, I've noticed that moving HOT items takes a lot of memory (~ 3
OffsetNumbers per increment of MaxHeapTuplesPerPage, plus some marking
bits) to implement it in O(n); which means it would probably warrant
its own loop in heap_page_prune seperate from the current
mark-and-sweep, triggered based on new measurements included in the
current mark-and-sweep of the prune loop.

Another idea I'm considering (no real implementation ideas) to add to
this extension patch is moving HOT tuples to make space for incoming
tuples, to guarantee that non-movable items are placed early on the
page. This increases the chances for PageRepairFragmentation to
eventually reclaim space from the item pointer array.

I have nothing much worth showing yet for these additional patches,
though, and all of it might not be worth the additional CPU cycles
(it's 'only' 4 bytes per line pointer cleared, so it might be
considered too expensive when also taking WAL into account).

Can someone, perhaps
even just the user, still have a reference to the old one and care
about us invalidating it? Maybe. But even if not, I'm not sure this
helps much with the situation you're concerned about, which involves
non-HOT tuples.

Users having references to TIDs of HOT tuples should in my opinion be
considered unknown behaviour. It might currently work, but the only
access to a HOT tuple that is guaranteed to work should be through the
chain's root. Breaking the current guarantee of HOT tuples not moving
might be worth it if we can get enough savings in storage (which is
also becoming more likely if MaxHeapTuplesPerPage is changed to larger
values). As to who actually uses / stores these references, I think
that the only place they are stored with some expectation of
persistence are in sequential heap scans, and that can be changed.

With regards,

Matthias van de Meent

#47

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#45)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 11, 2021 at 8:31 AM Robert Haas <robertmhaas@gmail.com> wrote:

I agree, but all you need is one long-lived tuple toward the end of
the array and you're stuck never being able to truncate it. It seems
like a worthwhile improvement, but whether it actually helps will be
workload-dependant.

When it comes to improving VACUUM I think that most of the really
interesting scenarios are workload dependent in one way or another. In
fact even that concept becomes a little meaningless much of the time.
For example with workloads that really benefit from bottom-up
deletion, the vast majority of individual leaf pages have quite a bit
of spare capacity at any given time. Again, "rare" events can have
outsized importance in the aggregate -- most of the time every leaf
page taken individually is a-okay!

It's certainly not just indexing stuff. We have a tendency to imagine
that HOT updates occur when indexes are not logically modified, except
perhaps in the presence of some kind of stressor, like a long-running
transaction. I guess that I do the same, informally. But let's not
forget that the reality is that very few tables *consistently* get HOT
updates, regardless of the shape of indexes and UPDATE statements. So
in the long run practically all tables in many ways consist of pages
that resemble those from a table that "only gets non-HOT updates" in
the simplest sense.

I suspect that the general preference for using lower-offset LP_UNUSED
items first (inside PageAddItemExtended()) will tend to make this
problem of "one high tuple that isn't dead" not so bad in many cases.
In any case Matthias' patch makes the situation strictly better, and
we can only fix one problem at a time. We have to start by eliminating
individual low-level behaviors that *don't make sense*.

Jan Wieck told me that he had to set heap fill factor to the
ludicrously conservative setting of 50 just to get the
TPC-C/BenchmarkSQL OORDER and ORDER_LINE tables to be stable over time
[1]: https://github.com/wieck/benchmarksql/blob/29b62435dc5c9eaf178983b43818fcbba82d4286/run/sql.postgres/extraCommandsBeforeLoad.sql#L1 -- Peter Geoghegan
are the biggest tables! It takes hours if not days or even weeks for
the situation to really get out of hand with a normal FF setting. I am
almost certain that this is due to second order effects (even third
order effects) that start from things like line pointer bloat and FSM
inefficiencies. I suspect that it doesn't matter too much if you make
heap fill factor 70 or 90 with these tables because the effect is
non-linear -- for whatever reason 50 was found to be the magic number,
through trial and error.

"Incremental VACUUM" (the broad concept, not just this one patch) is
likely to rely on our being able to make the performance
characteristics more linear, at least in future iterations. Of course
it's true that we should eliminate line pointer bloat and any kind of
irreversible bloat because the overall effect is non-linear, unstable
behavior, which is highly undesirable on its face. But it's also true
that these improvements leave us with more linear behavior at a
high-level, which is itself much easier to understand and model in a
top-down fashion. It then becomes possible to build a cost model that
makes VACUUM sensitive to the needs of the app, and how to make
on-disk sizes *stable* in a variety of conditions. So in that sense
I'd say that Matthias' patch is totally relevant.

I know that I sound hippy-dippy here. But the fact is that bottom-up
index deletion has *already* made the performance characteristics much
simpler and therefore much easier to model. I hope to do more of that.

[1]: https://github.com/wieck/benchmarksql/blob/29b62435dc5c9eaf178983b43818fcbba82d4286/run/sql.postgres/extraCommandsBeforeLoad.sql#L1 -- Peter Geoghegan
--
Peter Geoghegan

#48

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#42)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 9, 2021 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 2, 2021 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 2, 2021 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:

lazy_vacuum_table_and_indexes() should probably not skip index
vacuuming when we're close to exceeding the space allocated for the
LVDeadTuples array. Maybe we should not skip when
vacrelstats->dead_tuples->num_tuples is greater than 50% of
dead_tuples->max_tuples? Of course, this would only need to be
considered when lazy_vacuum_table_and_indexes() is only called once
for the entire VACUUM operation (otherwise we have far too little
maintenance_work_mem/dead_tuples->max_tuples anyway).

Doesn't it actually mean we consider how many dead *tuples* we
collected during a vacuum? I’m not sure how important the fact we’re
close to exceeding the maintenance_work_mem space. Suppose
maintenance_work_mem is 64MB, we will not skip both index vacuum and
heap vacuum if the number of dead tuples exceeds 5592404 (we can
collect 11184809 tuples with 64MB memory). But those tuples could be
concentrated in a small number of blocks, for example in a very large
table case. It seems to contradict the current strategy that we want
to skip vacuum if relatively few blocks are modified. No?

There are competing considerations. I think that we need to be
sensitive to accumulating "debt" here. The cost of index vacuuming
grows in a non-linear fashion as the index grows (or as
maintenance_work_mem is lowered). This is the kind of thing that we
should try to avoid, I think. I suspect that cases where we can skip
index vacuuming and heap vacuuming are likely to involve very few dead
tuples in most cases anyway.

We should not be sensitive to the absolute number of dead tuples when
it doesn't matter (say because they're concentrated in relatively few
heap pages). But when we overrun the maintenance_work_mem space, then
the situation changes; the number of dead tuples clearly matters just
because we run out of space for the TID array. The heap page level
skew is not really important once that happens.

That said, maybe there is a better algorithm. 50% was a pretty arbitrary number.

I agreed that when we're close to overrunning the
maintnenance_work_mem space, the situation changes. If we skip it in
even that case, the next vacuum will be likely to use up
maintenance_work_mem, leading to a second index scan. Which is
bad.

If this threshold is aimed to avoid a second index scan due to
overrunning the maintenance_work_mem, using a ratio of
maintenance_work_mem would be a good idea. On the other hand, if it's
to avoid accumulating debt affecting the cost of index vacuuming,
using a ratio of the total heap tuples seems better.

The situation where we need to deal with here is a very large table
that has a lot of dead tuples but those fit in fewer heap pages (less
than 1% of all heap blocks). In this case, it's likely that the number
of dead tuples also is relatively small compared to the total heap
tuples, as you mentioned. If dead tuples fitted in fewer pages but
accounted for most of all heap tuples in the heap, it would be a more
serious situation, there would definitely already be other problems.
So considering those conditions, I agreed to use a ratio of
maintenance_work_mem as a threshold. Maybe we can increase the
constant to 70, 80, or so.

Have you thought more about how the index vacuuming skipping can be
configured by users? Maybe a new storage param, that works like the
current SKIP_VACUUM_PAGES_RATIO constant?

Since it’s unclear to me yet that adding a new storage parameter or
GUC parameter for this feature would be useful even for future
improvements in this area, I haven't thought yet about having users
control skipping index vacuuming. I’m okay with a constant value for
the threshold for now.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#49

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#48)

Re: New IndexAM API controlling index vacuum strategies

On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agreed that when we're close to overrunning the
maintnenance_work_mem space, the situation changes. If we skip it in
even that case, the next vacuum will be likely to use up
maintenance_work_mem, leading to a second index scan. Which is
bad.

If this threshold is aimed to avoid a second index scan due to
overrunning the maintenance_work_mem, using a ratio of
maintenance_work_mem would be a good idea. On the other hand, if it's
to avoid accumulating debt affecting the cost of index vacuuming,
using a ratio of the total heap tuples seems better.

It's both, together. These are two *independent*
considerations/thresholds. At least in the code that decides whether
or not we skip. Either threshold can force a full index scan (index
vacuuming).

What I'm really worried about is falling behind (in terms of the
amount of memory available for TIDs to delete in indexes) without any
natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO
threshold (i.e. no maintenance_work_mem threshold thing). With just
SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index
vacuuming is almost always avoided, which is good. But
SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to
do 2 or even 3 scans of the index when we finally get to index
vacuuming then that's not great, it's inefficient -- but it's at least
*survivable*. But what if there are 10, 100, even 1000 bulk delete
calls for each index when it finally happens? That's completely
intolerable.

In other words, I am not worried about debt, exactly. Debt is normal
in moderation. Healthy, even. I am worried about bankruptcy, perhaps
following a rare and extreme event. It's okay to be imprecise, but all
of the problems must be survivable. The important thing to me for a
maintenance_work_mem threshold is that there is *some* limit. At the
same time, it may totally be worth accepting 2 or 3 index scans during
some eventual VACUUM operation if there are many more VACUUM
operations that don't even touch the index -- that's a good deal!
Also, it may actually be inherently necessary to accept a small risk
of having a future VACUUM operation that does multiple scans of each
index -- that is probably a necessary part of skipping index vacuuming
each time.

Think about the cost of index vacuuming (the amount of I/O and the
duration of index vacuuming) as less as less memory is available for
TIDs. It's non-linear. The cost explodes once we're past a certain
point. The truly important thing is to "never get killed by the
explosion".

The situation where we need to deal with here is a very large table
that has a lot of dead tuples but those fit in fewer heap pages (less
than 1% of all heap blocks). In this case, it's likely that the number
of dead tuples also is relatively small compared to the total heap
tuples, as you mentioned. If dead tuples fitted in fewer pages but
accounted for most of all heap tuples in the heap, it would be a more
serious situation, there would definitely already be other problems.
So considering those conditions, I agreed to use a ratio of
maintenance_work_mem as a threshold. Maybe we can increase the
constant to 70, 80, or so.

You mean 70% of maintenance_work_mem? That seems fine to me. See my
"Why does lazy_vacuum_table_and_indexes() not make one decision for
the entire VACUUM on the first call, and then stick to its decision?"
remarks at the end of this email, though -- maybe it should not be an
explicit threshold at all.

High level philosophical point: In general I think that the algorithm
for deciding whether or not to perform index vacuuming should *not* be
clever. It should also not focus on getting the benefit of skipping
index vacuuming. I think that a truly robust design will be one that
always starts with the assumption that index vacuuming will be
skipped, and then "works backwards" by considering thresholds/reasons
to *not* skip. For example, the SKIP_VACUUM_PAGES_RATIO thing. The
risk of "explosions" or "bankruptcy" can be thought of as a cost here,
too.

We should simply focus on the costs directly, without even trying to
understand the relationship between each of the costs, and without
really trying to understand the benefit to the user from skipping
index vacuuming.

Have you thought more about how the index vacuuming skipping can be
configured by users? Maybe a new storage param, that works like the
current SKIP_VACUUM_PAGES_RATIO constant?

Since it’s unclear to me yet that adding a new storage parameter or
GUC parameter for this feature would be useful even for future
improvements in this area, I haven't thought yet about having users
control skipping index vacuuming. I’m okay with a constant value for
the threshold for now.

I agree -- a GUC will become obsolete in only a year or two anyway.
And it will be too hard to tune.

Question about your patch: lazy_vacuum_table_and_indexes() can be
called multiple times (when low on maintenance_work_mem). Each time it
is called we decide what to do for that call and that batch of TIDs.
But...why should it work that way? The whole idea of a
SKIP_VACUUM_PAGES_RATIO style threshold doesn't make sense to me if
the code in lazy_vacuum_table_and_indexes() resets npages_deadlp (sets
it to 0) on each call. I think that npages_deadlp should never be
reset during a single VACUUM operation.

npages_deadlp is supposed to be something that we track for the entire
table. The patch actually compares it to the size of the whole table *
SKIP_VACUUM_PAGES_RATIO inside lazy_vacuum_table_and_indexes():

+   if (*npages_deadlp > RelationGetNumberOfBlocks(onerel) * SKIP_VACUUM_PAGES_RATIO)
+   {

+ }

The code that I have quoted here is actually how I expect
SKIP_VACUUM_PAGES_RATIO to work, but I notice an inconsistency:
lazy_vacuum_table_and_indexes() resets npages_deadlp later on, which
makes either the quoted code or the reset code wrong (at least when
VACUUM needs multiple calls to the lazy_vacuum_table_and_indexes()
function). With multiple calls to lazy_vacuum_table_and_indexes() (due
to low memory), we'll be comparing npages_deadlp to the wrong thing --
because npages_deadlp cannot be treated as a proportion of the blocks
in the *whole table*. Maybe the resetting of npages_deadlp would be
okay if you also used the number of heap blocks that were considered
since the last npages_deadlp reset, and then multiply that by
SKIP_VACUUM_PAGES_RATIO (instead of
RelationGetNumberOfBlocks(onerel)). But I suspect that the real
solution is to not reset npages_deadlp at all (without changing the
quoted code, which seems basically okay).

With tables/workloads that the patch helps a lot, we expect that the
SKIP_VACUUM_PAGES_RATIO threshold will *eventually* be crossed by one
of these VACUUM operations, which *finally* triggers index vacuuming.
So not only do we expect npages_deadlp to be tracked at the level of
the entire VACUUM operation -- we might even imagine it growing slowly
over multiple VACUUM operations, perhaps over many months. At least
conceptually -- it should only grow across VACUUM operations, until
index vacuuming finally takes place. That's my mental model for
npages_deadlp, at least. It tracks an easy to understand cost, which,
as I said, is what the
threshold/algorithm/lazy_vacuum_table_and_indexes() should focus on.

Why does lazy_vacuum_table_and_indexes() not make one decision for the
entire VACUUM on the first call, and then stick to its decision? That
seems much cleaner. Another advantage of that approach is that it
might be enough to handle low maintenance_work_mem risks -- perhaps
those can be covered by simply waiting until the first VACUUM
operation that runs out of memory and so requires multiple
lazy_vacuum_table_and_indexes() calls. If at that point we decide to
do index vacuuming throughout the entire vacuum operation, then we
will not allow the table to accumulate many more TIDs than we can
expect to fit in an entire maintenance_work_mem space.

Under this scheme, the "maintenance_work_mem threshold" can be thought
of as an implicit thing (it's not a constant/multiplier or anything)
-- it is >= 100% of maintenance_work_mem, in effect.

--
Peter Geoghegan

#50

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#49)

Re: New IndexAM API controlling index vacuum strategies

On Sat, Mar 13, 2021 at 7:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

In other words, I am not worried about debt, exactly. Debt is normal
in moderation. Healthy, even. I am worried about bankruptcy, perhaps
following a rare and extreme event. It's okay to be imprecise, but all
of the problems must be survivable. The important thing to me for a
maintenance_work_mem threshold is that there is *some* limit. At the
same time, it may totally be worth accepting 2 or 3 index scans during
some eventual VACUUM operation if there are many more VACUUM
operations that don't even touch the index -- that's a good deal!
Also, it may actually be inherently necessary to accept a small risk
of having a future VACUUM operation that does multiple scans of each
index -- that is probably a necessary part of skipping index vacuuming
each time.

Think about the cost of index vacuuming (the amount of I/O and the
duration of index vacuuming) as less as less memory is available for
TIDs. It's non-linear. The cost explodes once we're past a certain
point. The truly important thing is to "never get killed by the
explosion".

I just remembered this blog post, which gives a nice high level
summary of my mental model for things like this:

https://jessitron.com/2021/01/18/when-costs-are-nonlinear-keep-it-small/

This patch should eliminate inefficient index vacuuming involving very
small "batch sizes" (i.e. a small number of TIDs/index tuples to
delete from indexes). At the same time, it should not allow the batch
size to get too large because that's also inefficient. Perhaps larger
batch sizes are not exactly inefficient -- maybe they're risky. Though
risky is actually kind of the same thing as inefficient, at least to
me.

So IMV what we want to do here is to recognize cases where "batch
size" is so small that index vacuuming couldn't possibly be efficient.
We don't need to truly understand how that might change over time in
each case -- this is relatively easy.

There is some margin for error here, even with this reduced-scope
version that just does the SKIP_VACUUM_PAGES_RATIO thing. The patch
can afford to make suboptimal decisions about the scheduling of index
vacuuming over time (relative to the current approach), provided the
additional cost is at least *tolerable* -- that way we are still very
likely to win in the aggregate, over time. However, the patch cannot
be allowed to create a new risk of significantly worse performance for
any one VACUUM operation.

--
Peter Geoghegan

#51

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#45)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 11, 2021 at 8:31 AM Robert Haas <robertmhaas@gmail.com> wrote:

But even if not, I'm not sure this
helps much with the situation you're concerned about, which involves
non-HOT tuples.

Attached is a POC-quality revision of Masahiko's
skip_index_vacuum.patch [1]/messages/by-id/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com. There is an improved design for skipping
index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
12). I'm particularly interested in your perspective on this
refactoring stuff, Robert, because you ran into the same issues after
initial commit of the INDEX_CLEANUP reloption feature [2]/messages/by-id/23885.1555357618@sss.pgh.pa.us -- you ran
into issues with the "tupgone = true" special case. This is the case
where VACUUM considers a tuple dead that was not marked LP_DEAD by
pruning, and so needs to be killed in the second heap scan in
lazy_vacuum_heap() instead. You'll recall that these issues were fixed
by your commit dd695979888 from May 2019. I think that we need to go
further than you did in dd695979888 for this -- we ought to get rid of
the special case entirely.

ISTM that any new code that skips index vacuuming really ought to be
structured as a dynamic version of the "VACUUM (INDEX_CLEANUP OFF)"
mechanism. Or vice versa. The important thing is to recognize that
they're essentially the same thing, and to structure the code such
that they become exactly the same mechanism internally. That's not
trivial right now. But removing the awful "tupgone = true" special
case seems to buy us a lot -- it makes unifying everything relatively
straightforward. In particular, it makes it possible to delay the
decision to vacuum indexes until the last moment, which seems
essential to making index vacuuming optional. And so I have removed
the tupgone/XLOG_HEAP2_CLEANUP_INFO crud in the patch -- that's what
all of the changes relate to. This results in a net negative line
count, which is a nice bonus!

I've CC'd Noah, because my additions to this revision (of Masahiko's
patch) are loosely based on an abandoned 2013 patch from Noah [3]/messages/by-id/20130108024957.GA4751@tornado.leadboat.com --
Noah didn't care for the "tupgone = true" special case either. I think
that it's fair to say that Tom doesn't much care for it either [4]/messages/by-id/16814.1555348381@sss.pgh.pa.us, or
at least was distressed by its lack of test coverage as of a couple of
years ago -- which is a problem that still exists today. Honestly, I'm
surprised that somebody else hasn't removed the code in question
already, long ago -- what possible argument can be made for it now?

This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
get invoked as if it was like the "no indexes on table so do it all in
one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
OFF isn't able to call lazy_vacuum_page() at all (for the obvious
reason), so any similarity between the two cases was always
superficial -- skipping index vacuuming should not be confused with
doing a one-pass VACUUM/having no indexes at all. The original
INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
seemed confusing to me for this reason, FWIW.

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED). The big idea is that there is one choke
point that decides whether index vacuuming is needed at all at one
point in time, dynamically. vacuum_indexes_mark_unused() decides this
for us at the last moment. This can only happen during a VACUUM that
has enough memory to fit all TIDs -- otherwise we won't skip anything
dynamically.

We may in the future add additional criteria for skipping index
vacuuming. That can now just be added to the beginning of this new
vacuum_indexes_mark_unused() function. We may even teach
vacuum_indexes_mark_unused() to skip some indexes but not others in a
future release, a possibility that was already discussed at length
earlier in this thread. This new structure has all the context it
needs to do all of these things.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14. Can we come up with logic
that has us skip index vacuuming because XID wraparound is on the
verge of causing an outage? That seems like a strategically important
thing for Postgres, so perhaps we should try to get something like
that in. Practically every post mortem blog post involving Postgres
also involves anti-wraparound vacuum.

One consequence of my approach is that we now call
lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
itself. We should at least "check-in" with the indexes IMV. To an
index AM, this will be indistinguishable from a VACUUM that never had
tuples for it to delete, and so never called ambulkdelete() before
calling amvacuumcleanup(). This seems logical to me: why should there
be any significant behavioral divergence between the case where there
are 0 tuples to delete and the case where there is 1 tuple to delete?
The extra work that we perform in amvacuumcleanup() (if any) should
almost always be a no-op in nbtree following my recent refactoring
work. More generally, if an index AM is doing too much during cleanup,
and this becomes a bottleneck, then IMV that's a problem that needs to
be fixed in the index AM.

Masahiko: Note that I've also changed the SKIP_VACUUM_PAGES_RATIO
logic to never reset the count of heap blocks with one or more LP_DEAD
line pointers, per remarks in a recent email [5]/messages/by-id/CAH2-Wznpywm4qparkQxf2ns3c7BugC4x7VzKjiB8OCYswwu-=g@mail.gmail.com -- Peter Geoghegan -- that's now a table
level count of heap blocks. What do you think of that aspect? (BTW, I
pushed your fix for the "not setting has_dead_tuples/has_dead_items
variable" issue today, just to get it out of the way.)

[1]: /messages/by-id/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
[2]: /messages/by-id/23885.1555357618@sss.pgh.pa.us
[3]: /messages/by-id/20130108024957.GA4751@tornado.leadboat.com
[4]: /messages/by-id/16814.1555348381@sss.pgh.pa.us
[5]: /messages/by-id/CAH2-Wznpywm4qparkQxf2ns3c7BugC4x7VzKjiB8OCYswwu-=g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v2-0001-Skip-index-vacuuming-dynamically.patchapplication/octet-stream; name=v2-0001-Skip-index-vacuuming-dynamically.patchDownload

From 1be226b363b72b2326c1648389502e1f3b12ba64 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v2] Skip index vacuuming dynamically.

Based on skip_index_vacuum.patch from Masahiko Sawada:

https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com

Also remove tupgone special case to decouple index vacuuming from
initial heap scan's pruning.  Unify dynamic index vacuum skipping with
the index_cleanup mechanism added to Postgres 12 by commits a96c41fe and
dd695979.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |   4 +-
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         |  51 ---
 src/backend/access/heap/pruneheap.c      |  13 +-
 src/backend/access/heap/vacuumlazy.c     | 448 +++++++++++++----------
 src/backend/access/nbtree/nbtree.c       |   6 +-
 src/backend/access/rmgrdesc/heapdesc.c   |   9 -
 src/backend/commands/vacuum.c            |  15 +-
 src/backend/replication/logical/decode.c |   1 -
 11 files changed, 276 insertions(+), 289 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..150c2fe384 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -53,7 +53,7 @@
 #define XLOG_HEAP2_REWRITE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
 #define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+/* 0x30 is reserved */
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -397,8 +397,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
 								 OffsetNumber *nowdead, int ndead,
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..5da9805073 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_CLEAN records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..7b8b8c8b74 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_CLEAN records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3b435c107d..8b80aa1b10 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7945,29 +7945,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-clean operation.  Caller must already
  * have modified the buffer and marked it dirty.
@@ -8497,27 +8474,6 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 	return key_tuple;
 }
 
-/*
- * Handles CLEANUP_INFO
- */
-static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
 /*
  * Handles XLOG_HEAP2_CLEAN record type
  */
@@ -8536,10 +8492,6 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
@@ -9716,9 +9668,6 @@ heap2_redo(XLogReaderState *record)
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..ac7e540944 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -363,8 +358,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 366c122bd1..4b2bf29d51 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -294,8 +300,12 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* hasindex = true means two-pass strategy; false means one-pass */
+	bool		hasindex;
+	/* mayskipindexes = true means we may decide to skip vacuum indexing */
+	bool		mayskipindexes;
+	/* mustskipindexes = true means we must always skip vacuum indexing */
+	bool		mustskipindexes;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -312,7 +322,6 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Used for error callback */
@@ -344,20 +353,20 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static void vacuum_indexes_mark_unused(Relation onerel, LVRelStats *vacrelstats,
+									   Relation *Irel, IndexBulkDeleteResult **indstats,
+									   int nindexes, LVParallelState *lps,
+									   BlockNumber has_dead_items_pages);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-									IndexBulkDeleteResult **stats,
-									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
 static void lazy_cleanup_index(Relation indrel,
 							   IndexBulkDeleteResult **stats,
 							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-							 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static int mark_unused_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+							int tupindex, LVRelStats *vacrelstats,
+							Buffer *vmbuffer);
 static bool should_attempt_truncation(VacuumParams *params,
 									  LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
@@ -443,7 +452,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* not every AM requires these to be valid, but heap does */
@@ -502,8 +510,25 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+
+	/*
+	 * hasindex determines if we'll use one-pass strategy.  Note that this is
+	 * not the same thing as skipping index vacuuming.
+	 *
+	 * mayskipindexes tracks if we could in principle decide to skip index
+	 * vacuuming.  This can become false later.
+	 *
+	 * mustskipindexes tracks if we are obligated to skip index vacuuming
+	 * because of index_cleanup reloption.
+	 *
+	 * FIXME: This is too duplicative.  Fix this issue when you fix the
+	 * closely related VACOPT_TERNARY_DEFAULT issue in vacuum.c.
+	 */
+	vacrelstats->hasindex = (nindexes > 0);
+	vacrelstats->mayskipindexes =
+			(params->index_cleanup != VACOPT_TERNARY_ENABLED);
+	vacrelstats->mustskipindexes =
+			(params->index_cleanup == VACOPT_TERNARY_DISABLED);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -689,39 +714,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
-}
-
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -730,9 +722,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke vacuum_indexes_mark_unused to vacuum
+ *		indexes and mark dead line pointers for reuse via a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -762,7 +754,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
 	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
-				vacuumed_pages,
+				reuse_marked_pages,
+				has_dead_items_pages,
 				next_fsm_block_to_vacuum;
 	double		num_tuples,		/* total number of nonremovable tuples */
 				live_tuples,	/* live tuples (reltuples estimate) */
@@ -798,7 +791,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relnamespace,
 						vacrelstats->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = reuse_marked_pages = has_dead_items_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
@@ -810,7 +803,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	vacrelstats->scanned_pages = 0;
 	vacrelstats->tupcount_pages = 0;
 	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(onerel);
 
@@ -819,8 +811,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1)
 	{
+		Assert(vacrelstats->hasindex);
+
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
 		 * can't perform parallel vacuum on them.
@@ -937,8 +931,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
-		bool		tupgone,
-					hastup;
+		bool		hastup;
 		int			prev_dead_count;
 		int			nfrozen;
 		Size		freespace;
@@ -947,6 +940,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		bool		all_frozen = true;	/* provided all_visible is also true */
 		bool		has_dead_items;		/* includes existing LP_DEAD items */
 		TransactionId visibility_cutoff_xid = InvalidTransactionId;
+		bool		tuple_totally_frozen;
 
 		/* see note above about forcing scanning of last page */
 #define FORCE_CHECK_PAGE() \
@@ -1051,23 +1045,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * Won't be skipping index vacuuming now, since that is only
+			 * something vacuum_indexes_mark_unused() does when dead tuple
+			 * space hasn't been overrun.
 			 */
-			dead_tuples->num_tuples = 0;
+			vacrelstats->mayskipindexes = false;
+
+			/* Remove the collected garbage tuples from table and indexes */
+			vacuum_indexes_mark_unused(onerel, vacrelstats, Irel, indstats,
+									   nindexes, lps, has_dead_items_pages);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be worthwhile
+			 * since we could have updated the freespace of empty pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1240,7 +1233,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 */
 		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
 										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
 										 &vacrelstats->offnum);
 
 		/*
@@ -1310,8 +1302,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			tuple.t_len = ItemIdGetLength(itemid);
 			tuple.t_tableOid = RelationGetRelid(onerel);
 
-			tupgone = false;
-
 			/*
 			 * The criteria for counting a tuple as live in this block need to
 			 * match what analyze.c's acquire_sample_rows() does, otherwise
@@ -1337,14 +1327,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					 *
 					 * If the tuple is HOT-updated then it must only be
 					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
+					 * it were RECENTLY_DEAD.  Actually we always keep it
+					 * because it hardly seems worth introducing a special
+					 * case.  This allows us to delay committing to index
+					 * vacuuming until the last moment -- no need to worry
+					 * about making tuple LP_DEAD within mark_unused_page().
 					 *
 					 * If this were to happen for a tuple that actually needed
 					 * to be deleted, we'd be in trouble, because it'd
@@ -1353,12 +1340,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					 * to detect that case and abort the transaction,
 					 * preventing corruption.
 					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
+					nkeep += 1;
 					all_visible = false;
 					break;
 				case HEAPTUPLE_LIVE:
@@ -1443,35 +1425,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					break;
 			}
 
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
+			num_tuples += 1;
+			hastup = true;
 
-				num_tuples += 1;
-				hastup = true;
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing.  Note we already have exclusive buffer lock.
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  relfrozenxid, relminmxid,
+										  FreezeLimit, MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;
 
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
+			if (!tuple_totally_frozen)
+				all_frozen = false;
 		}						/* scan along page */
 
 		/*
@@ -1518,38 +1487,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/*
 		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
+		 * doing a second scan
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (nindexes == 0 && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(!vacrelstats->hasindex);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * Mark LP_DEAD item pointers for reuse now, in an incremental
+			 * fashion.  This is safe because the table has no indexes (and so
+			 * vacuum_indexes_mark_unused() will never be called).
 			 */
+			mark_unused_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			reuse_marked_pages++;
+			has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1660,12 +1613,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		if (hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.
+		 */
+		if (has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
+		 * visited again by vacuum_indexes_mark_unused, which will compute and
+		 * record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
+		 *
+		 * This path will always be taken if there are no indexes.  However,
+		 * it might not be taken if INDEX_CLEANUP is off -- that works the
+		 * same as the case where we decide to skip index vacuuming.  See also
+		 * vacuum_indexes_mark_unused(), where that is decided.
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
@@ -1706,20 +1671,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(vacrelstats->hasindex || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		vacuum_indexes_mark_unused(onerel, vacrelstats, Irel, indstats,
+								   nindexes, lps, has_dead_items_pages);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
@@ -1727,8 +1686,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	/*
+	 * Do post-vacuum cleanup.
+	 *
+	 * Note that this take places when vacuum_indexes_mark_unused() decided to
+	 * skip index vacuuming.
+	 */
+	if (vacrelstats->hasindex)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1738,16 +1702,27 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (ParallelVacuumIsActive(lps))
 		end_parallel_vacuum(indstats, lps, nindexes);
 
-	/* Update index statistics */
-	if (vacrelstats->useindex)
+	/*
+	 * Update index statistics.
+	 *
+	 * Note that this take places when vacuum_indexes_mark_unused() decided to
+	 * skip index vacuuming.
+	 */
+	if (vacrelstats->hasindex)
 		update_index_statistics(Irel, indstats, nindexes);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/*
+	 * If no indexes, make log report that vacuum_indexes_mark_unused would've
+	 * made when it skipped vacuuming.
+	 *
+	 * Note: We're distinguishing between "freed" (i.e. newly made LP_DEAD
+	 * through pruning) and removed (i.e. mark_unused_page() marked LP_UNUSED).
+	 */
+	if (!vacrelstats->hasindex)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
 						vacrelstats->relname,
-						tups_vacuumed, vacuumed_pages)));
+						tups_vacuumed, reuse_marked_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
@@ -1779,21 +1754,96 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ * Remove the collected garbage tuples from the table and its indexes.
  *
- * We process the indexes serially unless we are doing parallel vacuum.
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
+#define DEBUGELOG LOG
 static void
-lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-						IndexBulkDeleteResult **stats,
-						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
+vacuum_indexes_mark_unused(Relation onerel, LVRelStats *vacrelstats,
+						   Relation *Irel, IndexBulkDeleteResult **indstats,
+						   int nindexes, LVParallelState *lps,
+						   BlockNumber has_dead_items_pages)
 {
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
+	bool		skipping = false;
+	int			tupindex;
+	int			npages;
+	PGRUsage	ru0;
+	Buffer		vmbuffer = InvalidBuffer;
+	LVSavedErrInfo saved_err_info;
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
+	/* Should not end up here with no indexes */
+	Assert(vacrelstats->hasindex);
+	Assert(nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	/* In INDEX_CLEANUP off case we always skip index and heap vacuuming */
+	if (vacrelstats->mustskipindexes)
+	{
+		elog(DEBUGELOG, "must skip index vacuuming of %s", vacrelstats->relname);
+		skipping = true;
+	}
+
+	/*
+	 * Check whether or not to do index vacuum and heap vacuum.
+	 *
+	 * We do both index vacuum and heap vacuum if more than
+	 * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+	 * line pointer.  This is normally a case where dead tuples on the heap
+	 * are highly concentrated in relatively few heap blocks, where the
+	 * index's enhanced deletion mechanism that is clever about heap block
+	 * dead tuple concentrations including btree's bottom-up index deletion
+	 * works well.  Also, since we can clean only a few heap blocks, it would
+	 * be a less negative impact in terms of visibility map update.
+	 *
+	 * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+	 * vacrelstats->dead_tuples could have tuples which became dead after
+	 * HOT-pruning but are not marked dead yet.  We do not process them because
+	 * it's a very rare condition, and the next vacuum will process them anyway.
+	 */
+	else if (vacrelstats->mayskipindexes)
+	{
+		BlockNumber rel_pages_threshold;
+
+		rel_pages_threshold =
+				(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		if (has_dead_items_pages < rel_pages_threshold)
+		{
+			elog(DEBUGELOG, "decided to skip index vacuuming of %s with %u LP_DEAD blocks and %u block threshold with %u blocks in total",
+				 vacrelstats->relname, has_dead_items_pages, rel_pages_threshold, vacrelstats->rel_pages);
+			skipping = true;
+		}
+		else
+		{
+			elog(DEBUGELOG, "decided not to skip index vacuuming of %s with %u LP_DEAD blocks and %u block threshold with %u blocks in total",
+				 vacrelstats->relname, has_dead_items_pages, rel_pages_threshold, vacrelstats->rel_pages);
+		}
+	}
+	else
+	{
+		elog(DEBUGELOG, "never had the choice to skip index vacuuming of %s", vacrelstats->relname);
+	}
+
+	if (skipping)
+	{
+		/*
+		 * Note: We're distinguishing between "freed" (i.e. newly made LP_DEAD
+		 * through pruning) and removed (i.e. mark_unused_page() marked
+		 * LP_UNUSED).
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": freed %d row versions in %u pages",
+						vacrelstats->relname,
+						vacrelstats->dead_tuples->num_tuples,
+						vacrelstats->rel_pages)));
+
+		vacrelstats->dead_tuples->num_tuples = 0;
+		return;
+	}
+
+	/* Okay, we're going to do index vacuuming */
 
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -1813,14 +1863,16 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 		lps->lvshared->reltuples = vacrelstats->old_live_tuples;
 		lps->lvshared->estimated_count = true;
 
-		lazy_parallel_vacuum_indexes(Irel, stats, vacrelstats, lps, nindexes);
+		lazy_parallel_vacuum_indexes(Irel, indstats, vacrelstats, lps,
+									 nindexes);
 	}
 	else
 	{
 		int			idx;
 
 		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &stats[idx], vacrelstats->dead_tuples,
+			lazy_vacuum_index(Irel[idx], &indstats[idx],
+							  vacrelstats->dead_tuples,
 							  vacrelstats->old_live_tuples, vacrelstats);
 	}
 
@@ -1828,28 +1880,13 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	vacrelstats->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrelstats->num_index_scans);
-}
 
-
-/*
- *	lazy_vacuum_heap() -- second pass over the heap
- *
- *		This routine marks dead tuples as unused and compacts out free
- *		space on their pages.  Pages not having dead tuples recorded from
- *		lazy_scan_heap are not visited at all.
- *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
- */
-static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
-{
-	int			tupindex;
-	int			npages;
-	PGRUsage	ru0;
-	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
+	/*
+	 * Now mark LP_DEAD line pointers deleted from indexes as unused, and
+	 * compact out free space on pages -- this is the second heap pass.  Pages
+	 * not having dead tuples recorded from lazy_scan_heap are not visited at
+	 * all.
+	 */
 
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -1882,7 +1919,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 			++tupindex;
 			continue;
 		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
+		tupindex = mark_unused_page(onerel, tblk, buf, tupindex, vacrelstats,
 									&vmbuffer);
 
 		/* Now that we've compacted the page, record its available space */
@@ -1911,20 +1948,31 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrelstats, &saved_err_info);
+
+	/* Forget the now-vacuumed tuples */
+	vacrelstats->dead_tuples->num_tuples = 0;
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ * mark_unused_page() -- mark dead line pointers on page for reuse.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
  *
  * tupindex is the index in vacrelstats->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers from pruning that took place during
+ * lazy_scan_heap() (or from existing LP_DEAD line pointers encountered
+ * there).  However, we still share infrastructure with heap pruning, and
+ * still require a super-exclusive lock -- this should now be unnecessary.  In
+ * the future we should be able to optimize this -- it can work with only an
+ * exclusive lock.
  */
 static int
-lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+mark_unused_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
 {
 	LVDeadTuples *dead_tuples = vacrelstats->dead_tuples;
@@ -1954,6 +2002,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
@@ -1973,7 +2023,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+								InvalidTransactionId);
 		PageSetLSN(page, recptr);
 	}
 
@@ -2849,14 +2899,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2886,7 +2936,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c02c4e7710..1810a2e6aa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1204,9 +1204,9 @@ backtrack:
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
 				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..1018ed78be 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -134,12 +134,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -235,9 +229,6 @@ heap2_identify(uint8 info)
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..6ab6d7a431 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1880,11 +1880,18 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
+	/*
+	 * Set index cleanup option based on reloptions if not yet, though only if
+	 * set -- we want VACOPT_TERNARY_DEFAULT to mean "decide dynamically in
+	 * vacuumlazy.c".
+	 *
+	 * FIXME: This doesn't work when some other reloption is set -- what we
+	 * need is a new default, 'auto' or similar.
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT &&
+		onerel->rd_options != NULL)
 	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
+		if (((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
 			params->index_cleanup = VACOPT_TERNARY_ENABLED;
 		else
 			params->index_cleanup = VACOPT_TERNARY_DISABLED;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..11fcd861f7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -481,7 +481,6 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
 		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;

base-commit: 0ea71c93a06ddc38e0b72e48f1d512e5383a9c1b
-- 
2.27.0

#52

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#51)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-03-14 19:04:34 -0700, Peter Geoghegan wrote:

Attached is a POC-quality revision of Masahiko's
skip_index_vacuum.patch [1]. There is an improved design for skipping
index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
12). I'm particularly interested in your perspective on this
refactoring stuff, Robert, because you ran into the same issues after
initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
into issues with the "tupgone = true" special case. This is the case
where VACUUM considers a tuple dead that was not marked LP_DEAD by
pruning, and so needs to be killed in the second heap scan in
lazy_vacuum_heap() instead.

It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
afoul of edge cases around it in the last few years.

But removing the awful "tupgone = true" special case seems to buy us a
lot -- it makes unifying everything relatively straightforward. In
particular, it makes it possible to delay the decision to vacuum
indexes until the last moment, which seems essential to making index
vacuuming optional.

You haven't really justified, in the patch or this email, why it's OK to
remove the whole logic around HEAPTUPLE_DEAD part of the logic.

VACUUM can take a long time, and not removing space for all the
transactions that aborted while it wa

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED).

It doesn't really seem to be *just* doing that - doing the
PageRepairFragmentation() and all-visible marking is relevant too?

For me the patch does way too many things at once, making it harder than
necessary to review, test (including later bisection). I'd much rather
see the tupgone thing addressed on its own, without further changes, and
then the rest done in separate commits subsequently.

I don't like vacuum_indexes_mark_unused() as a name. That sounds like
the index is marked unused, not index entries pointing to tuples. Don't
really like mark_unused_page() either for similar reasons - but it's not
quite as confusing.

-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
+			num_tuples += 1;
+			hastup = true;

-				num_tuples += 1;
-				hastup = true;
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing.  Note we already have exclusive buffer lock.
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  relfrozenxid, relminmxid,
+										  FreezeLimit, MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

See
/messages/by-id/20200724165514.dnu5hr4vvgkssf5p@alap3.anarazel.de
for some discussion around the fragility.

Greetings,

Andres Freund

#53

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Andres Freund (#52)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 15, 2021 at 12:21 PM Andres Freund <andres@anarazel.de> wrote:

It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
afoul of edge cases around it in the last few years.

Right, which is why I thought that I might be missing something; why
put up with that at all for so long?

But removing the awful "tupgone = true" special case seems to buy us a
lot -- it makes unifying everything relatively straightforward. In
particular, it makes it possible to delay the decision to vacuum
indexes until the last moment, which seems essential to making index
vacuuming optional.

You haven't really justified, in the patch or this email, why it's OK to
remove the whole logic around HEAPTUPLE_DEAD part of the logic.

I don't follow.

VACUUM can take a long time, and not removing space for all the
transactions that aborted while it wa

I guess that you trailed off here. My understanding is that removing
the special case results in practically no loss of dead tuples removed
in practice -- so there are no practical performance considerations
here.

Have I missed something?

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED).

It doesn't really seem to be *just* doing that - doing the
PageRepairFragmentation() and all-visible marking is relevant too?

I wrote it in a day, just to show what I had in mind. The renaming
stuff is a part of unifying those functions, which can be discussed
after the "tupgone = true" special case is removed. It's not like I'm
set on the details that you see in the patch.

For me the patch does way too many things at once, making it harder than
necessary to review, test (including later bisection). I'd much rather
see the tupgone thing addressed on its own, without further changes, and
then the rest done in separate commits subsequently.

I agree that it should be broken up for review.

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1]/messages/by-id/20130130020456.GE3524@tornado.leadboat.com -- Peter Geoghegan added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I suppose that that was pre-9.3-MultiXacts, and so now it's more complicated.

Comments above heap_prepare_freeze_tuple() say something about making
sure that HTSV did not return HEAPTUPLE_DEAD...but that's already
possible today:

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

Does that need work too?

See
/messages/by-id/20200724165514.dnu5hr4vvgkssf5p@alap3.anarazel.de
for some discussion around the fragility.

That's a good reference, thanks.

[1]: /messages/by-id/20130130020456.GE3524@tornado.leadboat.com -- Peter Geoghegan
--
Peter Geoghegan

#54

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#53)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 15, 2021 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1] added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I'm not sure if you're arguing that there might be (either now or in
the future) a legitimate case (a case not involving data corruption)
where we hit HEAPTUPLE_DEAD, and find we have an XID in the tuple that
needs freezing. You seem to be suggesting that even throwing an error
might not be acceptable, but what better alternative is there? Did you
just mean that we should throw a *better*, more specific error right
there, when we handle HEAPTUPLE_DEAD? (As opposed to relying on
heap_prepare_freeze_tuple() to error out instead, which is what would
happen today.)

That seems like the most reasonable interpretation of your words to
me. That is, I think that you're saying (based in part on remarks on
that other thread [1]/messages/by-id/20200724165514.dnu5hr4vvgkssf5p@alap3.anarazel.de -- Peter Geoghegan) that you believe that fully eliminating the
"tupgone = true" special case is okay in principle, but that more
hardening is needed -- if it ever breaks we want to hear about it.
And, while it would be better to do a much broader refactor to unite
heap_prune_chain() and lazy_scan_heap(), it is not essential (because
the issue is not really new, even without VACUUM (INDEX_CLEANUP
OFF)/"params->index_cleanup == DISABLED").

Which is it?

[1]: /messages/by-id/20200724165514.dnu5hr4vvgkssf5p@alap3.anarazel.de -- Peter Geoghegan
--
Peter Geoghegan

#55

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#53)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-03-15 12:58:33 -0700, Peter Geoghegan wrote:

On Mon, Mar 15, 2021 at 12:21 PM Andres Freund <andres@anarazel.de> wrote:

It's evil sorcery. Fragile sorcery. I think Robert, Tom and me all run
afoul of edge cases around it in the last few years.

Right, which is why I thought that I might be missing something; why
put up with that at all for so long?

But removing the awful "tupgone = true" special case seems to buy us a
lot -- it makes unifying everything relatively straightforward. In
particular, it makes it possible to delay the decision to vacuum
indexes until the last moment, which seems essential to making index
vacuuming optional.

You haven't really justified, in the patch or this email, why it's OK to
remove the whole logic around HEAPTUPLE_DEAD part of the logic.

I don't follow.

VACUUM can take a long time, and not removing space for all the
transactions that aborted while it wa

I guess that you trailed off here. My understanding is that removing
the special case results in practically no loss of dead tuples removed
in practice -- so there are no practical performance considerations
here.

Have I missed something?

Forget what I said above - I had intended to remove it after dislogding
something stuck in my brain... But apparently didn't :(. Sorry.

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1] added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I'm not sure that's sufficient. If the case is legitimately reachable
(I'm maybe 60% is not, after staring at it for a long time, but ...),
then we can't just error out when we didn't so far.

I kinda wonder whether this case should just be handled by just gotoing
back to the start of the blkno loop, and redoing the pruning. The only
thing that makes that a bit more complicatd is that we've already
incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.

We really should put the per-page work (i.e. the blkno loop body) of
lazy_scan_heap() into a separate function, same with the
too-many-dead-tuples branch.

Comments above heap_prepare_freeze_tuple() say something about making
sure that HTSV did not return HEAPTUPLE_DEAD...but that's already
possible today:

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

Does that need work too?

I'm pretty scared of the index-cleanup-disabled path, for that reason. I
think the hot path is more likely to be unproblematic, but I'd not bet
my (nonexistant) farm on it.

Greetings,

Andres Freund

#56

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#54)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-03-15 13:58:02 -0700, Peter Geoghegan wrote:

On Mon, Mar 15, 2021 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1] added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I'm not sure if you're arguing that there might be (either now or in
the future) a legitimate case (a case not involving data corruption)
where we hit HEAPTUPLE_DEAD, and find we have an XID in the tuple that
needs freezing. You seem to be suggesting that even throwing an error
might not be acceptable, but what better alternative is there? Did you
just mean that we should throw a *better*, more specific error right
there, when we handle HEAPTUPLE_DEAD? (As opposed to relying on
heap_prepare_freeze_tuple() to error out instead, which is what would
happen today.)

Right now (outside of the index-cleanup-disabled path), we very well may
just actually successfully and correctly do the deletion? So there
clearly is another option?

See my email from a few minutes ago for a somewhat crude idea for how to
tackle the issue differently...

Greetings,

Andres Freund

#57

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Andres Freund (#55)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund <andres@anarazel.de> wrote:

I'm not comfortable with this change without adding more safety
checks. If there's ever a case in which the HEAPTUPLE_DEAD case is hit
and the xid needs to be frozen, we'll either cause errors or
corruption. Yes, that's already the case with params->index_cleanup ==
DISABLED, but that's not that widely used.

I noticed that Noah's similar 2013 patch [1] added a defensive
heap_tuple_needs_freeze() + elog(ERROR) to the HEAPTUPLE_DEAD case. I
suppose that that's roughly what you have in mind here?

I'm not sure that's sufficient. If the case is legitimately reachable
(I'm maybe 60% is not, after staring at it for a long time, but ...),
then we can't just error out when we didn't so far.

If you're only 60% sure that the heap_tuple_needs_freeze() error thing
doesn't break anything that should work by now then it seems unlikely
that you'll ever get past 90% sure. I think that we should make a
conservative assumption that the defensive elog(ERROR) won't be
sufficient, and proceed on that basis.

I kinda wonder whether this case should just be handled by just gotoing
back to the start of the blkno loop, and redoing the pruning. The only
thing that makes that a bit more complicatd is that we've already
incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.

That seems like a good solution to me -- this is a very seldom hit
path, so we can be a bit inefficient without it mattering.

It might make sense to *also* check some things (maybe using
heap_tuple_needs_freeze()) in passing, just for debugging purposes.

We really should put the per-page work (i.e. the blkno loop body) of
lazy_scan_heap() into a separate function, same with the
too-many-dead-tuples branch.

+1.

BTW I've noticed that the code (and code like it) tends to confuse
things that the current VACUUM performed versus things by *some*
VACUUM (that may or may not be current one). This refactoring might be
a good opportunity to think about that as well.

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

Does that need work too?

I'm pretty scared of the index-cleanup-disabled path, for that reason. I
think the hot path is more likely to be unproblematic, but I'd not bet
my (nonexistant) farm on it.

Well if we can solve the problem by simply doing pruning once again in
the event of a HEAPTUPLE_DEAD return value from the lazy_scan_heap()
HTSV call, then the comment becomes 100% true (which is not the case
even today).

--
Peter Geoghegan

#58

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#49)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Mar 14, 2021 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agreed that when we're close to overrunning the
maintnenance_work_mem space, the situation changes. If we skip it in
even that case, the next vacuum will be likely to use up
maintenance_work_mem, leading to a second index scan. Which is
bad.

If this threshold is aimed to avoid a second index scan due to
overrunning the maintenance_work_mem, using a ratio of
maintenance_work_mem would be a good idea. On the other hand, if it's
to avoid accumulating debt affecting the cost of index vacuuming,
using a ratio of the total heap tuples seems better.

It's both, together. These are two *independent*
considerations/thresholds. At least in the code that decides whether
or not we skip. Either threshold can force a full index scan (index
vacuuming).

What I'm really worried about is falling behind (in terms of the
amount of memory available for TIDs to delete in indexes) without any
natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO
threshold (i.e. no maintenance_work_mem threshold thing). With just
SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index
vacuuming is almost always avoided, which is good. But
SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to
do 2 or even 3 scans of the index when we finally get to index
vacuuming then that's not great, it's inefficient -- but it's at least
*survivable*. But what if there are 10, 100, even 1000 bulk delete
calls for each index when it finally happens? That's completely
intolerable.

In other words, I am not worried about debt, exactly. Debt is normal
in moderation. Healthy, even. I am worried about bankruptcy, perhaps
following a rare and extreme event. It's okay to be imprecise, but all
of the problems must be survivable. The important thing to me for a
maintenance_work_mem threshold is that there is *some* limit. At the
same time, it may totally be worth accepting 2 or 3 index scans during
some eventual VACUUM operation if there are many more VACUUM
operations that don't even touch the index -- that's a good deal!
Also, it may actually be inherently necessary to accept a small risk
of having a future VACUUM operation that does multiple scans of each
index -- that is probably a necessary part of skipping index vacuuming
each time.

Think about the cost of index vacuuming (the amount of I/O and the
duration of index vacuuming) as less as less memory is available for
TIDs. It's non-linear. The cost explodes once we're past a certain
point. The truly important thing is to "never get killed by the
explosion".

Agreed.

Maybe it's a good idea to add the ratio of dead tuples to the total
heap tuples as a threshold? I think that there are two risks in case
where we collect many dead tuples: maintenance_work_mem overrun and
LP_DEAD accumulation, even if those are concentrated in less than 1%
heap pages. The former risk is dealt with by the maintenance_work_mem
threshold as we discussed. But that threshold might not be enough to
deal with the latter risk. For example, a very larege table could have
many dead tuples in less than 1% heap pages and we may set
maintenance_work_mem to a high value. In that case, it might be ok in
terms of index vacuuming but might not be ok in terms of heap. So I
think we want not to skip index vacuuming. It’s an extream case. But
we also should note that the absolute number of tuples of 70% of
maintenance_work_mem is tend to increase if we improve memory
efficiency to store TIDs. So I think adding “dead tuples must be less
than 50% of total heap tuples” threshold to skip index vacuuming would
be a good safe guard against such an extream case.

This threshold is applied only at the last
lazy_vacuum_table_and_indexes() call so we know the total heap tuples
at that point. If we run out maintenance_work_mem in the middle of
heap scan, I think we should do index vacuuming regardless of the
number of dead tuples and the number of pages having at least one
LP_DEAD.

The situation where we need to deal with here is a very large table
that has a lot of dead tuples but those fit in fewer heap pages (less
than 1% of all heap blocks). In this case, it's likely that the number
of dead tuples also is relatively small compared to the total heap
tuples, as you mentioned. If dead tuples fitted in fewer pages but
accounted for most of all heap tuples in the heap, it would be a more
serious situation, there would definitely already be other problems.
So considering those conditions, I agreed to use a ratio of
maintenance_work_mem as a threshold. Maybe we can increase the
constant to 70, 80, or so.

You mean 70% of maintenance_work_mem? That seems fine to me.

Yes.

See my
"Why does lazy_vacuum_table_and_indexes() not make one decision for
the entire VACUUM on the first call, and then stick to its decision?"
remarks at the end of this email, though -- maybe it should not be an
explicit threshold at all.

High level philosophical point: In general I think that the algorithm
for deciding whether or not to perform index vacuuming should *not* be
clever. It should also not focus on getting the benefit of skipping
index vacuuming. I think that a truly robust design will be one that
always starts with the assumption that index vacuuming will be
skipped, and then "works backwards" by considering thresholds/reasons
to *not* skip. For example, the SKIP_VACUUM_PAGES_RATIO thing. The
risk of "explosions" or "bankruptcy" can be thought of as a cost here,
too.

We should simply focus on the costs directly, without even trying to
understand the relationship between each of the costs, and without
really trying to understand the benefit to the user from skipping
index vacuuming.

Agreed.

Question about your patch: lazy_vacuum_table_and_indexes() can be
called multiple times (when low on maintenance_work_mem). Each time it
is called we decide what to do for that call and that batch of TIDs.
But...why should it work that way? The whole idea of a
SKIP_VACUUM_PAGES_RATIO style threshold doesn't make sense to me if
the code in lazy_vacuum_table_and_indexes() resets npages_deadlp (sets
it to 0) on each call. I think that npages_deadlp should never be
reset during a single VACUUM operation.

npages_deadlp is supposed to be something that we track for the entire
table. The patch actually compares it to the size of the whole table *
SKIP_VACUUM_PAGES_RATIO inside lazy_vacuum_table_and_indexes():
+   if (*npages_deadlp > RelationGetNumberOfBlocks(onerel) * SKIP_VACUUM_PAGES_RATIO)
+   {
+ }

The code that I have quoted here is actually how I expect
SKIP_VACUUM_PAGES_RATIO to work, but I notice an inconsistency:
lazy_vacuum_table_and_indexes() resets npages_deadlp later on, which
makes either the quoted code or the reset code wrong (at least when
VACUUM needs multiple calls to the lazy_vacuum_table_and_indexes()
function). With multiple calls to lazy_vacuum_table_and_indexes() (due
to low memory), we'll be comparing npages_deadlp to the wrong thing --
because npages_deadlp cannot be treated as a proportion of the blocks
in the *whole table*. Maybe the resetting of npages_deadlp would be
okay if you also used the number of heap blocks that were considered
since the last npages_deadlp reset, and then multiply that by
SKIP_VACUUM_PAGES_RATIO (instead of
RelationGetNumberOfBlocks(onerel)). But I suspect that the real
solution is to not reset npages_deadlp at all (without changing the
quoted code, which seems basically okay).

Agreed. That was my bad.

IIUC we should use npages_deadlp throughout the entire vacuum
operation whereas use maintennace_work_mem threshold for each
index/table vacuum cycle.

With tables/workloads that the patch helps a lot, we expect that the
SKIP_VACUUM_PAGES_RATIO threshold will *eventually* be crossed by one
of these VACUUM operations, which *finally* triggers index vacuuming.
So not only do we expect npages_deadlp to be tracked at the level of
the entire VACUUM operation -- we might even imagine it growing slowly
over multiple VACUUM operations, perhaps over many months. At least
conceptually -- it should only grow across VACUUM operations, until
index vacuuming finally takes place. That's my mental model for
npages_deadlp, at least. It tracks an easy to understand cost, which,
as I said, is what the
threshold/algorithm/lazy_vacuum_table_and_indexes() should focus on.

Why does lazy_vacuum_table_and_indexes() not make one decision for the
entire VACUUM on the first call, and then stick to its decision? That
seems much cleaner. Another advantage of that approach is that it
might be enough to handle low maintenance_work_mem risks -- perhaps
those can be covered by simply waiting until the first VACUUM
operation that runs out of memory and so requires multiple
lazy_vacuum_table_and_indexes() calls. If at that point we decide to
do index vacuuming throughout the entire vacuum operation, then we
will not allow the table to accumulate many more TIDs than we can
expect to fit in an entire maintenance_work_mem space.

I might be missing something but even if
lazy_vacuum_table_and_indexes() decided to do index vacuuming
throughout the entire vacuum operation, it could skill skip it if it
collects only a few dead tuples in the next index vacuuming. Is that
right? Otherwise, it will end up doing index vacuuming against only a
few dead tuples.

I think we can do index vacuuming anyway if maintenance_work_mem runs
out. It's enough reason to do index vacuuming even if the collected
dead tuples are concentrated in less than 1% of total heap pages.
Assuming maintenance_work_mem won't be changed by the next vacuum
operation, it doesn't make sense to skip it at that time. So maybe we
can apply this threshold to index vacuuming called at the end of lazy
vacuum?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#59

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#51)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 11, 2021 at 8:31 AM Robert Haas <robertmhaas@gmail.com> wrote:

But even if not, I'm not sure this
helps much with the situation you're concerned about, which involves
non-HOT tuples.

Attached is a POC-quality revision of Masahiko's
skip_index_vacuum.patch [1]. There is an improved design for skipping
index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres
12). I'm particularly interested in your perspective on this
refactoring stuff, Robert, because you ran into the same issues after
initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran
into issues with the "tupgone = true" special case. This is the case
where VACUUM considers a tuple dead that was not marked LP_DEAD by
pruning, and so needs to be killed in the second heap scan in
lazy_vacuum_heap() instead. You'll recall that these issues were fixed
by your commit dd695979888 from May 2019. I think that we need to go
further than you did in dd695979888 for this -- we ought to get rid of
the special case entirely.

Thank you for the patch!

This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer
get invoked as if it was like the "no indexes on table so do it all in
one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP
OFF isn't able to call lazy_vacuum_page() at all (for the obvious
reason), so any similarity between the two cases was always
superficial -- skipping index vacuuming should not be confused with
doing a one-pass VACUUM/having no indexes at all. The original
INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always
seemed confusing to me for this reason, FWIW.

Agreed.

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused() (note also
that lazy_vacuum_page() has been renamed to mark_unused_page() to
reflect the fact that it is now strictly concerned with making LP_DEAD
line pointers LP_UNUSED). The big idea is that there is one choke
point that decides whether index vacuuming is needed at all at one
point in time, dynamically. vacuum_indexes_mark_unused() decides this
for us at the last moment. This can only happen during a VACUUM that
has enough memory to fit all TIDs -- otherwise we won't skip anything
dynamically.

We may in the future add additional criteria for skipping index
vacuuming. That can now just be added to the beginning of this new
vacuum_indexes_mark_unused() function. We may even teach
vacuum_indexes_mark_unused() to skip some indexes but not others in a
future release, a possibility that was already discussed at length
earlier in this thread. This new structure has all the context it
needs to do all of these things.

I agree to create a function like vacuum_indexes_mark_unused() that
makes a decision and does index and heap vacumming accordingly. But
what is the point of removing both lazy_vacuum_all_indexes() and
lazy_vacuum_heap()? I think we can simply have
vacuum_indexes_mark_unused() call those functions. I'm concerned that
if we add additional criteria to vacuum_indexes_mark_unused() in the
future the function will become very large.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14. Can we come up with logic
that has us skip index vacuuming because XID wraparound is on the
verge of causing an outage? That seems like a strategically important
thing for Postgres, so perhaps we should try to get something like
that in. Practically every post mortem blog post involving Postgres
also involves anti-wraparound vacuum.

I think we can set VACOPT_TERNARY_DISABLED to
tab->at_params.index_cleanup in table_recheck_autovac() or increase
the thresholds used to not skipping index vacuuming.

One consequence of my approach is that we now call
lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
itself. We should at least "check-in" with the indexes IMV. To an
index AM, this will be indistinguishable from a VACUUM that never had
tuples for it to delete, and so never called ambulkdelete() before
calling amvacuumcleanup(). This seems logical to me: why should there
be any significant behavioral divergence between the case where there
are 0 tuples to delete and the case where there is 1 tuple to delete?
The extra work that we perform in amvacuumcleanup() (if any) should
almost always be a no-op in nbtree following my recent refactoring
work. More generally, if an index AM is doing too much during cleanup,
and this becomes a bottleneck, then IMV that's a problem that needs to
be fixed in the index AM.

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is whatever the last ambulkdelete call returned, or NULL if
ambulkdelete was not called because no tuples needed to be deleted.
---

With this change, we could pass NULL to amvacuumcleanup even though
the index might have tuples needed to be deleted.

Masahiko: Note that I've also changed the SKIP_VACUUM_PAGES_RATIO
logic to never reset the count of heap blocks with one or more LP_DEAD
line pointers, per remarks in a recent email [5] -- that's now a table
level count of heap blocks. What do you think of that aspect?

Yeah, I agree with that change.

As I mentioned in a recent reply, I'm concerned about a case where we
ran out maintenance_work_mem and decided not to skip index vacuuming
but collected only a few dead tuples in the second index vacuuming
(i.g., the total amount of dead tuples is slightly larger than
maintenance_work_mem). In this case, I think we can skip the second
(i.g., final) index vacuuming at least in terms of
maintenance_work_mem. Maybe the same is true in terms of LP_DEAD
accumulation.

(BTW, I
pushed your fix for the "not setting has_dead_tuples/has_dead_items
variable" issue today, just to get it out of the way.)

Thanks!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#60

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#59)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 16, 2021 at 10:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 15, 2021 at 11:04 AM Peter Geoghegan <pg@bowt.ie> wrote:

One consequence of my approach is that we now call
lazy_cleanup_all_indexes(), even when we've skipped index vacuuming
itself. We should at least "check-in" with the indexes IMV. To an
index AM, this will be indistinguishable from a VACUUM that never had
tuples for it to delete, and so never called ambulkdelete() before
calling amvacuumcleanup(). This seems logical to me: why should there
be any significant behavioral divergence between the case where there
are 0 tuples to delete and the case where there is 1 tuple to delete?
The extra work that we perform in amvacuumcleanup() (if any) should
almost always be a no-op in nbtree following my recent refactoring
work. More generally, if an index AM is doing too much during cleanup,
and this becomes a bottleneck, then IMV that's a problem that needs to
be fixed in the index AM.

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is whatever the last ambulkdelete call returned, or NULL if
ambulkdelete was not called because no tuples needed to be deleted.
---

With this change, we could pass NULL to amvacuumcleanup even though
the index might have tuples needed to be deleted.

It seems there is no problem with that change at least in built-in
index AMs. So +1 for this change. We would need to slightly update the
doc accordingly.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#61

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#59)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused()

I agree to create a function like vacuum_indexes_mark_unused() that
makes a decision and does index and heap vacumming accordingly. But
what is the point of removing both lazy_vacuum_all_indexes() and
lazy_vacuum_heap()? I think we can simply have
vacuum_indexes_mark_unused() call those functions. I'm concerned that
if we add additional criteria to vacuum_indexes_mark_unused() in the
future the function will become very large.

I agree now. I became overly excited about advertising the fact that
these two functions are logically one thing. This is important, but it
isn't necessary to go as far as actually making everything into one
function. Adding some comments would also make that point clear, but
without making vacuumlazy.c even more spaghetti-like. I'll fix it.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14.

+1

I think we can set VACOPT_TERNARY_DISABLED to
tab->at_params.index_cleanup in table_recheck_autovac() or increase
the thresholds used to not skipping index vacuuming.

I was worried about the "tupgone = true" special case causing problems
when we decide to do some index vacuuming and some heap
vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
But I now think that getting rid of "tupgone = true" gives us total
freedom to
choose what to do, including the freedom to start with index vacuuming
and then give up on it later -- even after we do some amount of
LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
due to a low maintenance_work_mem setting). That isn't actually
special at all, because everything will be 100% decoupled.

Whether or not it's a good idea to either not start index vacuuming or
to end it early (e.g. due to XID wraparound pressure) is a complicated
question, and the right approach will be debatable in each case/when
thinking about each issue. However, deciding on the best behavior to
address these problems should have nothing to do with implementation
details and everything to do with the costs and benefits to users.
Which now seems possible.

A sophisticated version of the "XID wraparound pressure"
implementation could increase reliability without ever being
conservative, just by evaluating the situation regularly and being
prepared to change course when necessary -- until it is truly a matter
of shutting down new XID allocations/the server. It should be possible
to decide to end VACUUM early and advance relfrozenxid (provided we
have reached the point of finishing all required pruning and freezing,
of course). Highly agile behavior seems quite possible, even if it
takes a while to agree on a good design.

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is whatever the last ambulkdelete call returned, or NULL if
ambulkdelete was not called because no tuples needed to be deleted.
---

With this change, we could pass NULL to amvacuumcleanup even though
the index might have tuples needed to be deleted.

I think that we should add a "Note" box to the documentation that
notes the difference here. Though FWIW my interpretation of the words
"no tuples needed to be deleted" was always "no tuples needed to be
deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
VACUUM can add to tups_vacuumed through pruning inside
heap_prune_chain(). It is already possible (though only barely) to not
call ambulkdelete() for indexes (to only call amvacuumcleanup() during
cleanup) despite the fact that heap vacuuming did "delete tuples".

It's not that important, but my point is that the design has always
been top-down -- an index AM "needs to delete" whatever it is told it
needs to delete. It has no direct understanding of any higher-level
issues.

As I mentioned in a recent reply, I'm concerned about a case where we
ran out maintenance_work_mem and decided not to skip index vacuuming
but collected only a few dead tuples in the second index vacuuming
(i.g., the total amount of dead tuples is slightly larger than
maintenance_work_mem). In this case, I think we can skip the second
(i.g., final) index vacuuming at least in terms of
maintenance_work_mem. Maybe the same is true in terms of LP_DEAD
accumulation.

I remember that. That now seems very doable, but time grows short...

I have already prototyped Andres' idea, which was to eliminate the
HEAPTUPLE_DEAD case inside lazy_scan_heap() by restarting pruning for
the same page. I've also moved the pruning into its own function
called lazy_scan_heap_page(), because handling the restart requires
that we be careful about not incrementing things until we're sure we
won't need to repeat pruning.

This seems to work well, and the tests all pass. What I have right now
is still too rough to post to the list, though.

Even with a pg_usleep(10000) after the call to heap_page_prune() but
before the second/local HeapTupleSatisfiesVacuum() call, we almost
never actually hit the HEAPTUPLE_DEAD case. So the overhead must be
absolutely negligible. Adding a "goto restart" to the HEAPTUPLE_DEAD
case is ugly, but the "tupgone = true" thing is an abomination, so
that seems okay. This approach definitely seems like the way forward,
because it's obviously safe -- it may even be safer, because
heap_prepare_freeze_tuple() kind of expects this behavior today.

--
Peter Geoghegan

#62

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#61)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 17, 2021 at 7:21 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Note that I've merged multiple existing functions in vacuumlazy.c into
one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap()
into a single function named vacuum_indexes_mark_unused()

I agree to create a function like vacuum_indexes_mark_unused() that
makes a decision and does index and heap vacumming accordingly. But
what is the point of removing both lazy_vacuum_all_indexes() and
lazy_vacuum_heap()? I think we can simply have
vacuum_indexes_mark_unused() call those functions. I'm concerned that
if we add additional criteria to vacuum_indexes_mark_unused() in the
future the function will become very large.

I agree now. I became overly excited about advertising the fact that
these two functions are logically one thing. This is important, but it
isn't necessary to go as far as actually making everything into one
function. Adding some comments would also make that point clear, but
without making vacuumlazy.c even more spaghetti-like. I'll fix it.

I wonder if we can add some kind of emergency anti-wraparound vacuum
logic to what I have here, for Postgres 14.

+1

I think we can set VACOPT_TERNARY_DISABLED to
tab->at_params.index_cleanup in table_recheck_autovac() or increase
the thresholds used to not skipping index vacuuming.

I was worried about the "tupgone = true" special case causing problems
when we decide to do some index vacuuming and some heap
vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM.
But I now think that getting rid of "tupgone = true" gives us total
freedom to
choose what to do, including the freedom to start with index vacuuming
and then give up on it later -- even after we do some amount of
LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps
due to a low maintenance_work_mem setting). That isn't actually
special at all, because everything will be 100% decoupled.

Whether or not it's a good idea to either not start index vacuuming or
to end it early (e.g. due to XID wraparound pressure) is a complicated
question, and the right approach will be debatable in each case/when
thinking about each issue. However, deciding on the best behavior to
address these problems should have nothing to do with implementation
details and everything to do with the costs and benefits to users.
Which now seems possible.

A sophisticated version of the "XID wraparound pressure"
implementation could increase reliability without ever being
conservative, just by evaluating the situation regularly and being
prepared to change course when necessary -- until it is truly a matter
of shutting down new XID allocations/the server. It should be possible
to decide to end VACUUM early and advance relfrozenxid (provided we
have reached the point of finishing all required pruning and freezing,
of course). Highly agile behavior seems quite possible, even if it
takes a while to agree on a good design.

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

Aside from whether it's good or bad, strictly speaking, it could
change the index AM API contract. The documentation of
amvacuumcleanup() says:

---
stats is whatever the last ambulkdelete call returned, or NULL if
ambulkdelete was not called because no tuples needed to be deleted.
---

With this change, we could pass NULL to amvacuumcleanup even though
the index might have tuples needed to be deleted.

I think that we should add a "Note" box to the documentation that
notes the difference here. Though FWIW my interpretation of the words
"no tuples needed to be deleted" was always "no tuples needed to be
deleted because vacuumlazy.c didn't call ambulkdelete()". After all,
VACUUM can add to tups_vacuumed through pruning inside
heap_prune_chain(). It is already possible (though only barely) to not
call ambulkdelete() for indexes (to only call amvacuumcleanup() during
cleanup) despite the fact that heap vacuuming did "delete tuples".

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

poc_skip_index_cleanup_at_anti_wraparound.patchapplication/octet-stream; name=poc_skip_index_cleanup_at_anti_wraparound.patchDownload

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..113ddf1f5b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -331,12 +331,14 @@ static autovac_table *table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 static void recheck_relation_needs_vacanalyze(Oid relid, AutoVacOpts *avopts,
 											  Form_pg_class classForm,
 											  int effective_multixact_freeze_max_age,
-											  bool *dovacuum, bool *doanalyze, bool *wraparound);
+											  bool *dovacuum, bool *doanalyze, bool *wraparound,
+											  bool *skip_index_cleanup);
 static void relation_needs_vacanalyze(Oid relid, AutoVacOpts *relopts,
 									  Form_pg_class classForm,
 									  PgStat_StatTabEntry *tabentry,
 									  int effective_multixact_freeze_max_age,
-									  bool *dovacuum, bool *doanalyze, bool *wraparound);
+									  bool *dovacuum, bool *doanalyze, bool *wraparound,
+									  bool *skip_index_cleanup);
 
 static void autovacuum_do_vac_analyze(autovac_table *tab,
 									  BufferAccessStrategy bstrategy);
@@ -2080,6 +2082,7 @@ do_autovacuum(void)
 		bool		dovacuum;
 		bool		doanalyze;
 		bool		wraparound;
+		bool		skip_index_cleanup;
 
 		if (classForm->relkind != RELKIND_RELATION &&
 			classForm->relkind != RELKIND_MATVIEW)
@@ -2120,7 +2123,8 @@ do_autovacuum(void)
 		/* Check if it needs vacuum or analyze */
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+								  &dovacuum, &doanalyze, &wraparound,
+								  &skip_index_cleanup);
 
 		/* Relations that need work are added to table_oids */
 		if (dovacuum || doanalyze)
@@ -2173,6 +2177,7 @@ do_autovacuum(void)
 		bool		dovacuum;
 		bool		doanalyze;
 		bool		wraparound;
+		bool		skip_index_cleanup;
 
 		/*
 		 * We cannot safely process other backends' temp tables, so skip 'em.
@@ -2203,7 +2208,8 @@ do_autovacuum(void)
 
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+								  &dovacuum, &doanalyze, &wraparound,
+								  &skip_index_cleanup);
 
 		/* ignore analyze for toast tables */
 		if (dovacuum)
@@ -2801,6 +2807,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 	bool		doanalyze;
 	autovac_table *tab = NULL;
 	bool		wraparound;
+	bool		skip_index_cleanup;
 	AutoVacOpts *avopts;
 	static bool reuse_stats = false;
 
@@ -2842,7 +2849,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 	{
 		recheck_relation_needs_vacanalyze(relid, avopts, classForm,
 										  effective_multixact_freeze_max_age,
-										  &dovacuum, &doanalyze, &wraparound);
+										  &dovacuum, &doanalyze, &wraparound,
+										  &skip_index_cleanup);
 
 		/* Quick exit if a relation doesn't need to be vacuumed or analyzed */
 		if (!doanalyze && !dovacuum)
@@ -2857,7 +2865,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 
 	recheck_relation_needs_vacanalyze(relid, avopts, classForm,
 									  effective_multixact_freeze_max_age,
-									  &dovacuum, &doanalyze, &wraparound);
+									  &dovacuum, &doanalyze, &wraparound,
+									  &skip_index_cleanup);
 
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
@@ -2923,7 +2932,9 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.options = (dovacuum ? VACOPT_VACUUM : 0) |
 			(doanalyze ? VACOPT_ANALYZE : 0) |
 			(!wraparound ? VACOPT_SKIP_LOCKED : 0);
-		tab->at_params.index_cleanup = VACOPT_TERNARY_DEFAULT;
+		tab->at_params.index_cleanup = (skip_index_cleanup
+										? VACOPT_TERNARY_DISABLED
+										: VACOPT_TERNARY_DEFAULT);
 		tab->at_params.truncate = VACOPT_TERNARY_DEFAULT;
 		/* As of now, we don't support parallel vacuum for autovacuum */
 		tab->at_params.nworkers = -1;
@@ -2982,7 +2993,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
 								  int effective_multixact_freeze_max_age,
 								  bool *dovacuum,
 								  bool *doanalyze,
-								  bool *wraparound)
+								  bool *wraparound,
+								  bool *skip_index_cleanup)
 {
 	PgStat_StatTabEntry *tabentry;
 	PgStat_StatDBEntry *shared = NULL;
@@ -2999,7 +3011,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
 
 	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
 							  effective_multixact_freeze_max_age,
-							  dovacuum, doanalyze, wraparound);
+							  dovacuum, doanalyze, wraparound,
+							  skip_index_cleanup);
 
 	/* ignore ANALYZE for toast tables */
 	if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -3011,7 +3024,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * Check whether a relation needs to be vacuumed or analyzed; return each into
  * "dovacuum" and "doanalyze", respectively.  Also return whether the vacuum is
- * being forced because of Xid or multixact wraparound.
+ * being forced because of Xid or multixact wraparound and whether or not to skip
+ * index vacuuming.
  *
  * relopts is a pointer to the AutoVacOpts options (either for itself in the
  * case of a plain table, or for either itself or its parent table in the case
@@ -3052,7 +3066,8 @@ relation_needs_vacanalyze(Oid relid,
  /* output params below */
 						  bool *dovacuum,
 						  bool *doanalyze,
-						  bool *wraparound)
+						  bool *wraparound,
+						  bool *skip_index_cleanup)
 {
 	bool		force_vacuum;
 	bool		av_enabled;
@@ -3207,6 +3222,40 @@ relation_needs_vacanalyze(Oid relid,
 	/* ANALYZE refuses to work with pg_statistic */
 	if (relid == StatisticRelationId)
 		*doanalyze = false;
+
+	/*
+	 * If a table is at risk of wraparound, we further check if the table's
+	 * relfrozenxid is too older than autovacuum_freeze_max_age (more than
+	 * autovacuum_freeze_max_age * 1.5 XIDs old).  If so, we skip index vacuuming
+	 * to quickly complete vacuum operation and advance relfrozenxid.
+	 */
+	if (force_vacuum)
+	{
+		TransactionId	xidSkipIndCleanupLimit;
+		MultiXactId		multiSkipIndCleanupLimit;
+
+		freeze_max_age = Min(freeze_max_age * 1.5, MAX_AUTOVACUUM_FREEZE_MAX_AGE);
+		xidSkipIndCleanupLimit = recentXid - freeze_max_age;
+		if (xidSkipIndCleanupLimit < FirstNormalTransactionId)
+			xidSkipIndCleanupLimit -= FirstNormalTransactionId;
+
+		*skip_index_cleanup = (TransactionIdIsNormal(classForm->relfrozenxid) &&
+							  TransactionIdPrecedes(classForm->relfrozenxid,
+													xidSkipIndCleanupLimit));
+
+		if (!(*skip_index_cleanup))
+		{
+			multixact_freeze_max_age = Min(multixact_freeze_max_age * 1.5,
+										   MAX_AUTOVACUUM_FREEZE_MAX_AGE);
+			multiSkipIndCleanupLimit = recentMulti - multixact_freeze_max_age;
+			if (multiSkipIndCleanupLimit < FirstMultiXactId)
+				multiSkipIndCleanupLimit -= FirstMultiXactId;
+
+			*skip_index_cleanup = (MultiXactIdIsValid(classForm->relminmxid) &&
+								   MultiXactIdPrecedes(classForm->relminmxid,
+													   multiSkipIndCleanupLimit));
+		}
+	}
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..26ba3fdbbe 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3197,7 +3197,7 @@ static struct config_int ConfigureNamesInt[] =
 		},
 		&autovacuum_freeze_max_age,
 		/* see pg_resetwal if you change the upper-limit value */
-		200000000, 100000, 2000000000,
+		200000000, 100000, MAX_AUTOVACUUM_FREEZE_MAX_AGE,
 		NULL, NULL, NULL
 	},
 	{
@@ -3207,7 +3207,7 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_multixact_freeze_max_age,
-		400000000, 10000, 2000000000,
+		400000000, 10000, MAX_AUTOVACUUM_FREEZE_MAX_AGE,
 		NULL, NULL, NULL
 	},
 	{
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index aacdd0f575..e56e0d73ad 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -16,6 +16,12 @@
 
 #include "storage/block.h"
 
+/*
+ * Maximum value of autovacuum_freeze_max_age and
+ * autovacuum_multixact_freeze_max_age parameters.
+ */
+#define MAX_AUTOVACUUM_FREEZE_MAX_AGE	2000000000
+
 /*
  * Other processes can request specific work from autovacuum, identified by
  * AutoVacuumWorkItem elements.

#63

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Andres Freund (#55)

2 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 15, 2021 at 4:11 PM Andres Freund <andres@anarazel.de> wrote:

I kinda wonder whether this case should just be handled by just gotoing
back to the start of the blkno loop, and redoing the pruning. The only
thing that makes that a bit more complicatd is that we've already
incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}.

We really should put the per-page work (i.e. the blkno loop body) of
lazy_scan_heap() into a separate function, same with the
too-many-dead-tuples branch.

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Here are some key points about the first patch:

* Eliminates the "tupgone = true" special case by putting pruning, the
HTSV() call, as well as tuple freezing into a new, dedicated function
-- the function is prepared to restart pruning in those rare cases
where the vacuumlazy.c HTSV() call indicates that a tuple is dead.
Restarting pruning again prunes again, rendering the DEAD tuple with
storage an LP_DEAD line pointer stub.

The restart thing is based on Andres' suggestion.

This patch enables incremental VACUUM (the second patch, and likely
other variations) by allowing us to make a uniform assumption that it
is never strictly necessary to reach lazy_vacuum_all_indexes() or
lazy_vacuum_heap(). It is now possible to "end VACUUM early" while
still advancing relfrozenxid. Provided we've finished the first scan
of the heap, that should be safe.

* In principle we could visit and revisit the question of whether or
not vacuuming should continue or end early many times, as new
information comes to light. For example, maybe Masahiko's patch from
today could be taught to monitor how old relfrozenxid is again and
again, before finally giving up early when the risk of wraparound
becomes very severe -- but not before then.

* I've added three structs that replace a blizzard of local variables
we used lazy_scan_heap() with just three (three variables for each of
the three structs). I've also moved several chunks of logic to other
new functions (in addition to one that does pruning and freezing).

I think that I have the data structures roughly right here -- but I
would appreciate some feedback on that. Does this seem like the right
direction?

--
Peter Geoghegan

Attachments:

v3-0002-Skip-index-vacuuming-dynamically.patchapplication/octet-stream; name=v3-0002-Skip-index-vacuuming-dynamically.patchDownload

From 66eaa24f8d3ebb3ca7e4311c24cc7240b1de3fd3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Mar 2021 18:27:36 -0700
Subject: [PATCH v3 2/2] Skip index vacuuming dynamically.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
---
 src/include/commands/vacuum.h          |   3 +-
 src/include/utils/rel.h                |  10 +-
 src/backend/access/common/reloptions.c |  39 ++++++--
 src/backend/access/heap/vacuumlazy.c   | 127 ++++++++++++++++++++-----
 src/backend/commands/vacuum.c          |  33 ++++---
 5 files changed, 167 insertions(+), 45 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..2c7c18829d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -216,7 +217,7 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
+	VacOptIndexCleanupValue index_cleanup;	/* Do index vacuum and cleanup,
 										 * default value depends on reloptions */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..9cd1922941 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,13 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+typedef enum VacOptIndexCleanupValue
+{
+	VACOPT_CLEANUP_AUTO = 0,
+	VACOPT_CLEANUP_DISABLED,
+	VACOPT_CLEANUP_ENABLED
+} VacOptIndexCleanupValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +311,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptIndexCleanupValue	vacuum_index_cleanup;	/* enables index vacuuming and
+												 * cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 } StdRdOptions;
 
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d897bbec2b..9e328a5523 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -140,15 +140,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -492,6 +483,23 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/*
+ * values from VacOptTernaryValue for index_cleanup option.
+ * Allowing boolean values other than "on" and "off" are for
+ * backward compatibility as the option is used to be an
+ * boolean.
+ */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_CLEANUP_AUTO},
+	{"true", VACOPT_CLEANUP_ENABLED},
+	{"false", VACOPT_CLEANUP_DISABLED},
+	{"on", VACOPT_CLEANUP_ENABLED},
+	{"off", VACOPT_CLEANUP_DISABLED},
+	{"1", VACOPT_CLEANUP_ENABLED},
+	{"0", VACOPT_CLEANUP_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -516,6 +524,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_CLEANUP_AUTO,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1856,7 +1875,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 960d34b627..0bed78bd17 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -382,7 +388,8 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 static void two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 							  Relation *Irel, IndexBulkDeleteResult **indstats,
 							  int nindexes, LVParallelState *lps,
-							  VacOptTernaryValue index_cleanup);
+							  VacOptIndexCleanupValue index_cleanup,
+							  BlockNumber has_dead_items_pages, bool onecall);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -485,7 +492,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* measure elapsed time iff autovacuum logging requires it */
@@ -1320,11 +1326,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	bool			calledtwopass = false;
 	LVTempCounters c;
 
 	/* Counters of # blocks in onerel: */
 	BlockNumber empty_pages,
-				vacuumed_pages;
+				vacuumed_pages,
+				has_dead_items_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -1339,7 +1347,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relnamespace,
 						vacrelstats->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = vacuumed_pages = has_dead_items_pages = 0;
 
 	/* Initialize counters */
 	c.num_tuples = 0;
@@ -1602,9 +1610,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Won't be skipping index vacuuming now, since that is only
+			 * something two_pass_strategy() does when dead tuple space hasn't
+			 * been overrun.
+			 */
+			calledtwopass = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
 			two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-							  lps, params->index_cleanup);
+							  lps, params->index_cleanup,
+							  has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1740,6 +1756,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
 						&c, &ps, &vms);
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.  Note that this is exactly the same
+		 * thing as having items that are stored in dead_tuples space, because
+		 * scan_prune_page() doesn't count anything other than LP_DEAD items
+		 * as dead (as of PostgreSQL 14).
+		 */
+		if (ps.has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
 		 * end of loop.  (Also performs extra single pass strategy steps in
@@ -1754,9 +1781,18 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		savefreespace = false;
 		freespace = 0;
 		if (nindexes > 0 && ps.has_dead_items &&
-			params->index_cleanup != VACOPT_TERNARY_DISABLED)
+			params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		{
-			/* Wait until lazy_vacuum_heap() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap() in INDEX_CLEANUP = AUTO case (which is the
+			 * common case) -- two_pass_strategy() might opt to skip index
+			 * vacuuming (and so must skip heap vacuuming).  This is deemed
+			 * okay, because there can't be very much free space when this
+			 * happens.
+			 */
 		}
 		else
 		{
@@ -1868,7 +1904,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
 		two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-						  lps, params->index_cleanup);
+						  lps, params->index_cleanup,
+						  has_dead_items_pages, !calledtwopass);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1883,10 +1920,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Do post-vacuum cleanup.
 	 *
-	 * Note that post-vacuum cleanup does not take place with
+	 * Note that post-vacuum cleanup is supposed to take place when
+	 * two_pass_strategy() decided to skip index vacuuming, but not with
 	 * INDEX_CLEANUP=OFF.
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1899,10 +1937,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Update index statistics.
 	 *
-	 * Note that updating the statistics does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that updating the statistics takes places when two_pass_strategy()
+	 * decided to skip index vacuuming, but not with INDEX_CLEANUP=OFF.
+	 *
+	 * (In practice most index AMs won't have accurate statistics from
+	 * cleanup, but the index AM API allows them to, so we must check.)
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		update_index_statistics(Irel, indstats, nindexes);
 
 	/* If no indexes, make log report that two_pass_strategy() would've made */
@@ -1945,12 +1986,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
- * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
 two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 				  Relation *Irel, IndexBulkDeleteResult **indstats, int nindexes,
-				  LVParallelState *lps, VacOptTernaryValue index_cleanup)
+				  LVParallelState *lps, VacOptIndexCleanupValue index_cleanup,
+				  BlockNumber has_dead_items_pages, bool onecall)
 {
 	bool		skipping;
 
@@ -1958,11 +2001,43 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 	Assert(nindexes > 0);
 	Assert(!IsParallelWorker());
 
-	/* Check whether or not to do index vacuum and heap vacuum */
-	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+	/*
+	 * Check whether or not to do index vacuum and heap vacuum.
+	 *
+	 * We do both index vacuum and heap vacuum if more than
+	 * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+	 * line pointer.  This is normally a case where dead tuples on the heap
+	 * are highly concentrated in relatively few heap blocks, where the
+	 * index's enhanced deletion mechanism that is clever about heap block
+	 * dead tuple concentrations including btree's bottom-up index deletion
+	 * works well.  Also, since we can clean only a few heap blocks, it would
+	 * be a less negative impact in terms of visibility map update.
+	 *
+	 * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+	 * vacrelstats->dead_tuples could have tuples which became dead after
+	 * HOT-pruning but are not marked dead yet.  We do not process them because
+	 * it's a very rare condition, and the next vacuum will process them anyway.
+	 */
+	if (index_cleanup == VACOPT_CLEANUP_DISABLED)
 		skipping = true;
-	else
+	else if (index_cleanup == VACOPT_CLEANUP_ENABLED)
 		skipping = false;
+	else if (!onecall)
+		skipping = false;
+	else
+	{
+		BlockNumber rel_pages_threshold;
+
+		Assert(onecall && index_cleanup == VACOPT_CLEANUP_AUTO);
+
+		rel_pages_threshold =
+				(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		if (has_dead_items_pages < rel_pages_threshold)
+			skipping = true;
+		else
+			skipping = false;
+	}
 
 	if (!skipping)
 	{
@@ -1987,10 +2062,18 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 		 * one or more LP_DEAD items (could be from us or from another
 		 * VACUUM), not # blocks scanned.
 		 */
-		ereport(elevel,
-				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
-						vacrelstats->relname,
-						vacrelstats->dead_tuples->num_tuples)));
+		if (index_cleanup == VACOPT_CLEANUP_AUTO)
+			ereport(elevel,
+					(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
+		else
+			ereport(elevel,
+					(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
 	}
 
 	/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..0d3aece45b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -108,7 +108,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	ListCell   *lc;
 
 	/* Set default value */
-	params.index_cleanup = VACOPT_TERNARY_DEFAULT;
+	params.index_cleanup = VACOPT_CLEANUP_AUTO;
 	params.truncate = VACOPT_TERNARY_DEFAULT;
 
 	/* By default parallel vacuum is enabled */
@@ -140,7 +140,14 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		else if (strcmp(opt->defname, "disable_page_skipping") == 0)
 			disable_page_skipping = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "index_cleanup") == 0)
-			params.index_cleanup = get_vacopt_ternary_value(opt);
+		{
+			if (opt->arg == NULL || strcmp(defGetString(opt), "auto") == 0)
+				params.index_cleanup = VACOPT_CLEANUP_AUTO;
+			else if (defGetBoolean(opt))
+				params.index_cleanup = VACOPT_CLEANUP_ENABLED;
+			else
+				params.index_cleanup = VACOPT_CLEANUP_DISABLED;
+		}
 		else if (strcmp(opt->defname, "process_toast") == 0)
 			process_toast = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "truncate") == 0)
@@ -1880,15 +1887,19 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option based on reloptions if not set to either ON or
+	 * OFF.  Note that an VACUUM(INDEX_CLEANUP=AUTO) command is interpreted as
+	 * "prefer reloption, but if it's not set dynamically determine if index
+	 * vacuuming and cleanup" takes place in vacuumlazy.c.  Note also that the
+	 * reloption might be explicitly set to AUTO.
+	 *
+	 * XXX: Do we really want that?
+	 */
+	if (params->index_cleanup == VACOPT_CLEANUP_AUTO &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
 	/* Set truncate option based on reloptions if not yet */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
-- 
2.27.0

v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 2b6efbaacd1861740da6038a9d5ae1172643d130 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v3 1/2] Remove tupgone special case from vacuumlazy.c.

Decouple index vacuuming from initial heap scan's pruning.  Unify
dynamic index vacuum skipping with the index_cleanup mechanism added to
Postgres 12 by commits a96c41fe and dd695979.
---
 src/include/access/heapam.h              |    2 +-
 src/include/access/heapam_xlog.h         |    4 +-
 src/backend/access/gist/gistxlog.c       |    8 +-
 src/backend/access/hash/hash_xlog.c      |    8 +-
 src/backend/access/heap/heapam.c         |   51 -
 src/backend/access/heap/pruneheap.c      |   13 +-
 src/backend/access/heap/vacuumlazy.c     | 1386 +++++++++++++---------
 src/backend/access/nbtree/nbtree.c       |    6 +-
 src/backend/access/rmgrdesc/heapdesc.c   |    9 -
 src/backend/replication/logical/decode.c |    1 -
 10 files changed, 814 insertions(+), 674 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..150c2fe384 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -53,7 +53,7 @@
 #define XLOG_HEAP2_REWRITE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
 #define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+/* 0x30 is reserved */
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -397,8 +397,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 								 OffsetNumber *redirected, int nredirected,
 								 OffsetNumber *nowdead, int ndead,
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..5da9805073 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_CLEAN records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..7b8b8c8b74 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_CLEAN records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..c2cf5cb00a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7947,29 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-clean operation.  Caller must already
  * have modified the buffer and marked it dirty.
@@ -8499,27 +8476,6 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 	return key_tuple;
 }
 
-/*
- * Handles CLEANUP_INFO
- */
-static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
 /*
  * Handles XLOG_HEAP2_CLEAN record type
  */
@@ -8538,10 +8494,6 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
@@ -9718,9 +9670,6 @@ heap2_redo(XLogReaderState *record)
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..ac7e540944 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -363,8 +358,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..960d34b627 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -294,8 +294,6 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -312,7 +310,6 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Used for error callback */
@@ -330,9 +327,47 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
+/*
+ * Counters maintained by lazy_scan_heap() (and scan_prune_page()):
+ */
+typedef struct LVTempCounters
+{
+	double	num_tuples;		/* total number of nonremovable tuples */
+	double	live_tuples;	/* live tuples (reltuples estimate) */
+	double	tups_vacuumed;	/* tuples cleaned up by current vacuum */
+	double	nkeep;			/* dead-but-not-removable tuples */
+	double	nunused;		/* # existing unused line pointers */
+} LVTempCounters;
+
+/*
+ * State output by scan_prune_page():
+ */
+typedef struct LVPrunePageState
+{
+	bool		  hastup;			/* Page is truncatable? */
+	bool		  has_dead_items;	/* includes existing LP_DEAD items */
+	bool		  all_visible;		/* Every item visible to all? */
+	bool		  all_frozen;		/* provided all_visible is also true */
+} LVPrunePageState;
+
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * scan_prune_page()) that represents VM bit status.
+ *
+ * Used by scan_setvmbit_page() when we're done pruning.
+ */
+typedef struct LVVisMapPageState
+{
+	bool		  all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVVisMapPageState;
+
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
+static TransactionId RelFrozenXid;
+static MultiXactId RelMinMxid;
+
 static TransactionId OldestXmin;
 static TransactionId FreezeLimit;
 static MultiXactId MultiXactCutoff;
@@ -344,6 +379,10 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+							  Relation *Irel, IndexBulkDeleteResult **indstats,
+							  int nindexes, LVParallelState *lps,
+							  VacOptTernaryValue index_cleanup);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -363,7 +402,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 bool hasindex);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -448,10 +488,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -474,6 +510,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	vac_strategy = bstrategy;
 
+	RelFrozenXid = onerel->rd_rel->relfrozenxid;
+	RelMinMxid = onerel->rd_rel->relminmxid;
+
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(RelFrozenXid));
+	Assert(MultiXactIdIsValid(RelMinMxid));
+
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -509,8 +552,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -708,36 +749,524 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 }
 
 /*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
  */
 static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+scan_new_page(Relation onerel, Buffer buf)
 {
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
 		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				LVRelStats *vacrelstats)
+{
+	Page	page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size freespace = PageGetHeapFreeSpace(page);
 
 	/*
-	 * No need to write the record at all unless it contains a valid value
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see scan_new_page()).
 	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.
+ */
+static void
+scan_prune_page(Relation onerel, Buffer buf,
+				LVRelStats *vacrelstats,
+				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
+				LVTempCounters *c, LVPrunePageState *ps,
+				LVVisMapPageState *vms)
+{
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	HTSV_Result tuplestate;
+	int			  nfrozen,
+				  ndead;
+	LVTempCounters pc;
+	OffsetNumber deaditems[MaxHeapTuplesPerPage];
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+retry:
+
+	/* Initialize (or reset) page-level counters */
+	pc.num_tuples = 0;
+	pc.live_tuples = 0;
+	pc.tups_vacuumed = 0;
+	pc.nkeep = 0;
+	pc.nunused = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
+									   InvalidTransactionId, 0, false,
+									   &vacrelstats->offnum);
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 *
+	 * Note: If we retry having set vms.visibility_cutoff_xid it doesn't
+	 * matter -- the newest XMIN on page can't be missed this way.
+	 */
+	ps->hastup = false;
+	ps->has_dead_items = false;
+	ps->all_visible = true;
+	ps->all_frozen = true;
+	nfrozen = 0;
+	ndead = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+#ifdef DEBUG
+	/*
+	 * Enable this to debug the retry logic -- it's actually quite hard to hit
+	 * even with this artificial delay
+	 */
+	pg_usleep(10000);
+#endif
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+		bool		tuple_totally_frozen;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrelstats->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			pc.nunused += 1;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			ps->hastup = true;	/* this page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tups_vacuumed, else we'd be double-counting (at least
+		 * in the common case where heap_page_prune() just freed up a non-HOT
+		 * tuple).
+		 *
+		 * Note also that the final tups_vacuumed value might be very low for
+		 * tables where opportunistic page pruning happens to occur very
+		 * frequently (via heap_page_prune_opt() calls that free up non-HOT
+		 * tuples).
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deaditems[ndead++] = offnum;
+			ps->all_visible = false;
+			ps->has_dead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		tuplestate = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf);
+
+		if (unlikely(tuplestate == HEAPTUPLE_DEAD))
+			goto retry;
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (tuplestate)
+		{
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's
+				 * also what acquire_sample_rows() does.
+				 */
+				pc.live_tuples += 1;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check
+				 * that the tuple is hinted xmin-committed because of
+				 * that.
+				 */
+				if (ps->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old
+					 * enough that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin, vms->visibility_cutoff_xid))
+						vms->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				pc.nkeep += 1;
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect
+				 * the inserting transaction to update the counters at
+				 * commit, and we assume that will happen only after we
+				 * report our results.  This assumption is a bit shaky,
+				 * but it is what acquire_sample_rows() does, so be
+				 * consistent.
+				 */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				ps->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the
+				 * deleting transaction will commit and update the
+				 * counters after we report.
+				 */
+				pc.live_tuples += 1;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		pc.num_tuples += 1;
+		ps->hastup = true;
+
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  RelFrozenXid, RelMinMxid,
+									  FreezeLimit, MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
+			frozen[nfrozen++].offset = offnum;
+
+		if (!tuple_totally_frozen)
+			ps->all_frozen = false;
+	}
+
+	/*
+	 * Success -- we're done pruning, and have determined which tuples are to
+	 * be recorded as dead in local array.  We've also prepared the details of
+	 * which remaining tuples are to be frozen.
+	 *
+	 * First clear the offset information once we have processed all the
+	 * tuples on the page.
+	 */
+	vacrelstats->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Next add page level counters to caller's counts
+	 */
+	c->num_tuples += pc.num_tuples;
+	c->live_tuples += pc.live_tuples;
+	c->tups_vacuumed += pc.tups_vacuumed;
+	c->nkeep += pc.nkeep;
+	c->nunused += pc.nunused;
+
+	/*
+	 * Now save the local dead items array to VACUUM's dead_tuples array.
+	 */
+	for (int i = 0; i < ndead; i++)
+	{
+		ItemPointerData itemptr;
+
+		ItemPointerSet(&itemptr, blkno, deaditems[i]);
+		lazy_record_dead_tuple(vacrelstats->dead_tuples, &itemptr);
+	}
+
+	/*
+	 * Finally, execute tuple freezing as planned.
+	 *
+	 * If we need to freeze any tuples we'll mark the buffer dirty, and write
+	 * a WAL record recording the changes.  We must log the changes to be
+	 * crash-safe against future truncation of CLOG.
+	 */
+	if (nfrozen > 0)
+	{
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			ItemId		itemid;
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(onerel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(onerel, buf, FreezeLimit,
+									 frozen, nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+scan_setvmbit_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				   LVPrunePageState *ps, LVVisMapPageState *vms)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (ps->all_visible && !vms->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (ps->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, vms->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (vms->all_visible_according_to_vm && !PageIsAllVisible(page) &&
+			 VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && ps->has_dead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (vms->all_visible_according_to_vm && ps->all_visible &&
+			 ps->all_frozen && !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
 }
 
 /*
@@ -748,9 +1277,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke two_pass_strategy to vacuum indexes and
+ *		mark dead line pointers for reuse via a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -775,23 +1304,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
 	IndexBulkDeleteResult **indstats;
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
@@ -802,6 +1320,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	LVTempCounters c;
+
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -817,18 +1340,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	/* Initialize counters */
+	c.num_tuples = 0;
+	c.live_tuples = 0;
+	c.tups_vacuumed = 0;
+	c.nkeep = 0;
+	c.nunused = 0;
 
 	indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	nblocks = RelationGetNumberOfBlocks(onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->scanned_pages = 0;
 	vacrelstats->tupcount_pages = 0;
 	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(onerel);
 
@@ -837,7 +1366,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -865,7 +1394,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes > 0);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -920,7 +1449,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -953,20 +1481,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVVisMapPageState vms;
+		LVPrunePageState ps;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/* Initialize vm state for block: */
+		vms.all_visible_according_to_vm = false;
+		vms.visibility_cutoff_xid = InvalidTransactionId;
+
+		/* Note: Can't touch ps until we reach scan_prune_page() */
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
@@ -1018,7 +1548,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
 			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+				vms.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1045,12 +1575,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					vacrelstats->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			vms.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1069,23 +1602,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
+							  lps, params->index_cleanup);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be worthwhile
+			 * since we could have updated the freespace of empty pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1096,22 +1621,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
 		visibilitymap_pin(onerel, blkno, &vmbuffer);
 
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vac_strategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		  hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1168,6 +1700,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrelstats->scanned_pages++;
 		vacrelstats->tupcount_pages++;
 
@@ -1175,399 +1713,84 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			scan_new_page(onerel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			scan_empty_page(onerel, buf, vmbuffer, vacrelstats);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
+						&c, &ps, &vms);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before scan_prune_page()) then the page
+		 * will be visited again by lazy_vacuum_heap(), which will compute and
+		 * record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (nindexes > 0 && ps.has_dead_items &&
+			params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrelstats->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+			/* Wait until lazy_vacuum_heap() to save free space */
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/*
+			 * Will never reach lazy_vacuum_heap() (or will, but won't reach
+			 * this specific page)
+			 */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (nindexes == 0 && ps.has_dead_items)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap().
 			 */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure scan_setvmbit_page() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
+			 */
+			ps.has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1584,109 +1807,27 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		scan_setvmbit_page(onerel, buf, vmbuffer, &ps, &vms);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (ps.hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1698,14 +1839,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrelstats->tuples_deleted = c.tups_vacuumed;
+	vacrelstats->new_dead_tuples = c.nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
 														  nblocks,
 														  vacrelstats->tupcount_pages,
-														  live_tuples);
+														  c.live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1724,20 +1865,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
+						  lps, params->index_cleanup);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
@@ -1745,8 +1880,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	/*
+	 * Do post-vacuum cleanup.
+	 *
+	 * Note that post-vacuum cleanup does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1756,23 +1896,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (ParallelVacuumIsActive(lps))
 		end_parallel_vacuum(indstats, lps, nindexes);
 
-	/* Update index statistics */
-	if (vacrelstats->useindex)
+	/*
+	 * Update index statistics.
+	 *
+	 * Note that updating the statistics does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		update_index_statistics(Irel, indstats, nindexes);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/* If no indexes, make log report that two_pass_strategy() would've made */
+	Assert(nindexes == 0 || vacuumed_pages == 0);
+	if (nindexes == 0)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
 						vacrelstats->relname,
-						tups_vacuumed, vacuumed_pages)));
+						vacrelstats->tuples_deleted, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 c.nkeep, OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 c.nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrelstats->pinskipped_pages),
@@ -1788,18 +1934,77 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": newly pruned %.0f items, found %.0f nonremovable items in %u out of %u pages",
 					vacrelstats->relname,
-					tups_vacuumed, num_tuples,
+					c.tups_vacuumed, c.num_tuples,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ * Remove the collected garbage tuples from the table and its indexes.
  *
- * We process the indexes serially unless we are doing parallel vacuum.
+ * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ */
+static void
+two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+				  Relation *Irel, IndexBulkDeleteResult **indstats, int nindexes,
+				  LVParallelState *lps, VacOptTernaryValue index_cleanup)
+{
+	bool		skipping;
+
+	/* Should not end up here with no indexes */
+	Assert(nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	/* Check whether or not to do index vacuum and heap vacuum */
+	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+		skipping = true;
+	else
+		skipping = false;
+
+	if (!skipping)
+	{
+		/* Okay, we're going to do index vacuuming */
+		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+								nindexes);
+
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * skipped index vacuuming.  Make log report that lazy_vacuum_heap
+		 * would've made.
+		 *
+		 * Don't report tups_vacuumed here because it will be zero here in
+		 * common case where there are no newly pruned LP_DEAD items for this
+		 * VACUUM.  This is roughly consistent with lazy_vacuum_heap(), and
+		 * the similar !useindex ereport() at the end of lazy_scan_heap().
+		 * Note, however, that has_dead_items_pages is # of heap pages with
+		 * one or more LP_DEAD items (could be from us or from another
+		 * VACUUM), not # blocks scanned.
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
+						vacrelstats->relname,
+						vacrelstats->dead_tuples->num_tuples)));
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful
+	 * not to reset latestRemovedXid since we want that value to be
+	 * valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+}
+
+/*
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Should only be called through two_pass_strategy()
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -1810,9 +2015,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	Assert(!IsParallelWorker());
 	Assert(nindexes > 0);
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -1848,17 +2050,14 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
- *	lazy_vacuum_heap() -- second pass over the heap
+ *	lazy_vacuum_heap() -- second pass over the heap for two pass strategy
  *
  *		This routine marks dead tuples as unused and compacts out free
  *		space on their pages.  Pages not having dead tuples recorded from
  *		lazy_scan_heap are not visited at all.
  *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
+ * Should only be called through two_pass_strategy()
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -1932,7 +2131,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
+ *	lazy_vacuum_page() -- free LP_DEAD items on a page,
  *					 and repair its fragmentation.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
@@ -1940,6 +2139,15 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
  * tupindex is the index in vacrelstats->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers from pruning that took place during
+ * lazy_scan_heap() (or from existing LP_DEAD line pointers encountered
+ * there).  However, we still share infrastructure with heap pruning, and
+ * still require a super-exclusive lock -- this should now be unnecessary.  In
+ * the future we should be able to optimize this -- it can work with only an
+ * exclusive lock.
  */
 static int
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
@@ -1972,6 +2180,8 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
@@ -1991,7 +2201,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 		recptr = log_heap_clean(onerel, buffer,
 								NULL, 0, NULL, 0,
 								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+								InvalidTransactionId);
 		PageSetLSN(page, recptr);
 	}
 
@@ -2004,7 +2214,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted in the log_heap_clean() above.
@@ -2867,14 +3077,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2899,12 +3109,12 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks, bool hasindex)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c02c4e7710..1810a2e6aa 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1204,9 +1204,9 @@ backtrack:
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
 				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..1018ed78be 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -134,12 +134,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -235,9 +229,6 @@ heap2_identify(uint8 info)
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..11fcd861f7 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -481,7 +481,6 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
 		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
-- 
2.27.0

#64

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#62)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

Most anti-wraparound VACUUMs are really not emergencies, though. So
treating them as special simply because they're anti-wraparound
vacuums doesn't seem like the right thing to do. I think that we
should dynamically decide to do this when (antiwraparound) VACUUM has
already been running for some time. We need to delay the decision
until it is almost certainly true that we really have an emergency.

Can you take what you have here, and make the decision dynamic? Delay
it until we're done with the first heap scan? This will require
rebasing on top of the patch I posted. And then adding a third patch,
a little like the second patch -- but not too much like it.

In the second/SKIP_VACUUM_PAGES_RATIO patch I posted today, the
function two_pass_strategy() (my new name for the main entry point for
calling lazy_vacuum_all_indexes() and lazy_vacuum_heap()) is only
willing to perform the "skip index vacuuming" optimization when the
call to two_pass_strategy() is the first call and the last call for
that entire VACUUM (plus we test the number of heap blocks with
LP_DEAD items using SKIP_VACUUM_PAGES_RATIO, of course). It works this
way purely because I don't think that we should be aggressive when
we've already run out of maintenance_work_mem. That's a bad time to
apply a performance optimization.

But what you're talking about now isn't a performance optimization
(the mechanism is similar or the same, but the underlying reasons are
totally different) -- it's a safety/availability thing. I don't think
that you need to be concerned about running out of
maintenance_work_mem in two_pass_strategy() when applying logic that
is concerned about keeping the database online by avoiding XID
wraparound. You just need to have high confidence that it is a true
emergency. I think that we can be ~99% sure that we're in a real
emergency by using dynamic information about how old relfrozenxid is
*now*, and by rechecking a few times during VACUUM. Probably by
rechecking every time we call two_pass_strategy().

I now believe that there is no fundamental correctness issue with
teaching two_pass_strategy() to skip index vacuuming when we're low on
memory -- it is 100% a matter of costs and benefits. The core
skip-index-vacuuming mechanism is very flexible. If we can be sure
that it's a real emergency, I think that we can justify behaving very
aggressively (letting indexes get bloated is after all very
aggressive). We just need to be 99%+ sure that continuing with
vacuuming will be worse that ending vacuuming. Which seems possible by
making the decision dynamic (and revisiting it at least a few times
during a very long VACUUM, in case things change).

--
Peter Geoghegan

#65

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#64)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 18, 2021 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since I was thinking that always skipping index vacuuming on
anti-wraparound autovacuum is legitimate, skipping index vacuuming
only when we're really close to the point of going into read-only mode
seems a bit conservative, but maybe a good start. I've attached a PoC
patch to disable index vacuuming if the table's relfrozenxid is too
older than autovacuum_freeze_max_age (older than 1.5x of
autovacuum_freeze_max_age).

Most anti-wraparound VACUUMs are really not emergencies, though. So
treating them as special simply because they're anti-wraparound
vacuums doesn't seem like the right thing to do. I think that we
should dynamically decide to do this when (antiwraparound) VACUUM has
already been running for some time. We need to delay the decision
until it is almost certainly true that we really have an emergency.

That's a good idea to delay the decision until two_pass_strategy().

Can you take what you have here, and make the decision dynamic? Delay
it until we're done with the first heap scan? This will require
rebasing on top of the patch I posted. And then adding a third patch,
a little like the second patch -- but not too much like it.

Attached the updated patch that can be applied on top of your v3 patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v3-0003-Skip-index-vacuuming-when-there-is-a-risk-of-wrap.patchapplication/octet-stream; name=v3-0003-Skip-index-vacuuming-when-there-is-a-risk-of-wrap.patchDownload

From 6d227af53823866b56cbb7932a7d8e4f21f764d0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 18 Mar 2021 14:30:10 +0900
Subject: [PATCH v3 3/3] Skip index vacuuming when there is a risk of
 wraparound.

If a table's relfrozenxid/relminmxid is too older than freeze max age
threshold (e.g., autovacuum_freeze_max_age * 1.5 XIDsold), we skip
index vacuuming to complete lazy vacuum quickly and advance
relfrozenxid/relminmxid.
---
 src/backend/access/heap/vacuumlazy.c | 92 ++++++++++++++++++++++++----
 src/backend/utils/misc/guc.c         |  4 +-
 src/include/postmaster/autovacuum.h  |  6 ++
 3 files changed, 89 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bed78bd17..435a2df763 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -388,8 +388,9 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 static void two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 							  Relation *Irel, IndexBulkDeleteResult **indstats,
 							  int nindexes, LVParallelState *lps,
-							  VacOptIndexCleanupValue index_cleanup,
-							  BlockNumber has_dead_items_pages, bool onecall);
+							  VacuumParams *params, BlockNumber has_dead_items_pages,
+							  bool onecall);
+static bool check_index_cleanup_xid_limit(Relation onerel);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -1619,8 +1620,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			/* Remove the collected garbage tuples from table and indexes */
 			two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-							  lps, params->index_cleanup,
-							  has_dead_items_pages, false);
+							  lps, params, has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1904,8 +1904,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
 		two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-						  lps, params->index_cleanup,
-						  has_dead_items_pages, !calledtwopass);
+						  lps, params, has_dead_items_pages, !calledtwopass);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1992,7 +1991,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 static void
 two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 				  Relation *Irel, IndexBulkDeleteResult **indstats, int nindexes,
-				  LVParallelState *lps, VacOptIndexCleanupValue index_cleanup,
+				  LVParallelState *lps, VacuumParams *params,
 				  BlockNumber has_dead_items_pages, bool onecall)
 {
 	bool		skipping;
@@ -2018,17 +2017,30 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 	 * HOT-pruning but are not marked dead yet.  We do not process them because
 	 * it's a very rare condition, and the next vacuum will process them anyway.
 	 */
-	if (index_cleanup == VACOPT_CLEANUP_DISABLED)
+	if (params->index_cleanup == VACOPT_CLEANUP_DISABLED)
 		skipping = true;
-	else if (index_cleanup == VACOPT_CLEANUP_ENABLED)
+	else if (params->index_cleanup == VACOPT_CLEANUP_ENABLED)
 		skipping = false;
 	else if (!onecall)
 		skipping = false;
+
+	/*
+	 * If a table is at risk of wraparound, we further check if the table's
+	 * relfrozenxid/relminmxid is too older than freeze maximum age (e.g., more
+	 * than autovacuum_freeze_max_age * 1.5 XIDs old).  If so, we disable index
+	 * vacuuming to quickly complete vacuum operation and advance relfrozenxid/relminmxid.
+	 * Note that this can be applied to only autovacuum workers.
+	 */
+	else if (params->is_wraparound && check_index_cleanup_xid_limit(onerel))
+	{
+		Assert(IsAutoVacuumWorkerProcess());
+		skipping = true;
+	}
 	else
 	{
 		BlockNumber rel_pages_threshold;
 
-		Assert(onecall && index_cleanup == VACOPT_CLEANUP_AUTO);
+		Assert(onecall && params->index_cleanup == VACOPT_CLEANUP_AUTO);
 
 		rel_pages_threshold =
 				(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
@@ -2062,7 +2074,7 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 		 * one or more LP_DEAD items (could be from us or from another
 		 * VACUUM), not # blocks scanned.
 		 */
-		if (index_cleanup == VACOPT_CLEANUP_AUTO)
+		if (params->index_cleanup == VACOPT_CLEANUP_AUTO)
 			ereport(elevel,
 					(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
 							vacrelstats->relname,
@@ -2084,6 +2096,64 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 	vacrelstats->dead_tuples->num_tuples = 0;
 }
 
+/*
+ * Return true if the table's relfrozenxid/relminmxid is too older than
+ * freeze age.
+ */
+static bool
+check_index_cleanup_xid_limit(Relation onerel)
+{
+	StdRdOptions *relopts = (StdRdOptions *) onerel->rd_options;
+	TransactionId xid_skip_limit;
+	MultiXactId multi_skip_limit;
+	int freeze_max_age;
+	int multixact_freeze_max_age;
+	int effective_multixact_freeze_max_age;
+
+	/*
+	 * Check if table's relfrozenxid is too older than autovacuum_freeze_max_age
+	 * (more than autovacuum_freeze_max_age * 1.5 XIDs old).
+	 */
+	freeze_max_age = (relopts && relopts->autovacuum.freeze_max_age >= 0)
+		? Min(relopts->autovacuum.freeze_max_age, autovacuum_freeze_max_age)
+		: autovacuum_freeze_max_age;
+	freeze_max_age = Min(freeze_max_age * 1.5, MAX_AUTOVACUUM_FREEZE_MAX_AGE);
+
+	xid_skip_limit = ReadNextTransactionId() - freeze_max_age;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(onerel->rd_rel->relfrozenxid) &&
+		TransactionIdPrecedes(onerel->rd_rel->relfrozenxid,
+							  xid_skip_limit))
+		return true;
+
+	/*
+	 * Similar to above, check multixact age.  This is normally
+	 * autovacuum_multixact_freeze_max_age, but may be less if we are short of
+	 * multixact member space.
+	 */
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+	multixact_freeze_max_age = (relopts && relopts->autovacuum.multixact_freeze_max_age >= 0)
+		? Min(relopts->autovacuum.multixact_freeze_max_age,
+			  effective_multixact_freeze_max_age)
+		: effective_multixact_freeze_max_age;
+	multixact_freeze_max_age = Min(multixact_freeze_max_age * 1.5,
+								   MAX_AUTOVACUUM_FREEZE_MAX_AGE);
+
+	multi_skip_limit = ReadNextMultiXactId() - multixact_freeze_max_age;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(onerel->rd_rel->relminmxid) &&
+		MultiXactIdPrecedes(onerel->rd_rel->relminmxid,
+							multi_skip_limit))
+		return true;
+
+	return false;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b263e3493b..53aa444e13 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3207,7 +3207,7 @@ static struct config_int ConfigureNamesInt[] =
 		},
 		&autovacuum_freeze_max_age,
 		/* see pg_resetwal if you change the upper-limit value */
-		200000000, 100000, 2000000000,
+		200000000, 100000, MAX_AUTOVACUUM_FREEZE_MAX_AGE,
 		NULL, NULL, NULL
 	},
 	{
@@ -3217,7 +3217,7 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_multixact_freeze_max_age,
-		400000000, 10000, 2000000000,
+		400000000, 10000, MAX_AUTOVACUUM_FREEZE_MAX_AGE,
 		NULL, NULL, NULL
 	},
 	{
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index aacdd0f575..e56e0d73ad 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -16,6 +16,12 @@
 
 #include "storage/block.h"
 
+/*
+ * Maximum value of autovacuum_freeze_max_age and
+ * autovacuum_multixact_freeze_max_age parameters.
+ */
+#define MAX_AUTOVACUUM_FREEZE_MAX_AGE	2000000000
+
 /*
  * Other processes can request specific work from autovacuum, identified by
  * AutoVacuumWorkItem elements.
-- 
2.24.3 (Apple Git-128)

#66

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#65)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 17, 2021 at 11:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached the updated patch that can be applied on top of your v3 patches.

Some feedback on this:

* I think that we can afford to be very aggressive here, because we're
checking dynamically. And we're concerned about extremes only. So an
age of as high as 1 billion transactions seems like a better approach.
What do you think?

* I think that you need to remember that we have decided not to do any
more index vacuuming, rather than calling
check_index_cleanup_xid_limit() each time -- maybe store that
information in a state variable.

This seems like a good idea because we should try to avoid changing
back to index vacuuming having decided to skip it once. Also, we need
to refer to this in lazy_scan_heap(), so that we avoid index cleanup
having also avoided index vacuuming. This is like the INDEX_CLEANUP =
off case, which is also only for emergencies. It is not like the
SKIP_VACUUM_PAGES_RATIO case, which is just an optimization.

Thanks
--
Peter Geoghegan

#67

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#66)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 18, 2021 at 3:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 17, 2021 at 11:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached the updated patch that can be applied on top of your v3 patches.

Some feedback on this:

* I think that we can afford to be very aggressive here, because we're
checking dynamically. And we're concerned about extremes only. So an
age of as high as 1 billion transactions seems like a better approach.
What do you think?

If we have the constant threshold of 1 billion transactions, a vacuum
operation might not be an anti-wraparound vacuum and even not be an
aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
the purpose of skipping index vacuuming in this case, I think it
doesn't make sense to have non-aggressive vacuum skip index vacuuming
since it might not be able to advance relfrozenxid. If we have a
constant threshold, 2 billion transactions, maximum value of
autovacuum_freeze_max_age, seems to work.

* I think that you need to remember that we have decided not to do any
more index vacuuming, rather than calling
check_index_cleanup_xid_limit() each time -- maybe store that
information in a state variable.

This seems like a good idea because we should try to avoid changing
back to index vacuuming having decided to skip it once.

Once decided to skip index vacuuming due to too old relfrozenxid
stuff, the decision never be changed within the same vacuum operation,
right? Because the relfrozenxid is advanced at the end of vacuum.

Also, we need
to refer to this in lazy_scan_heap(), so that we avoid index cleanup
having also avoided index vacuuming. This is like the INDEX_CLEANUP =
off case, which is also only for emergencies. It is not like the
SKIP_VACUUM_PAGES_RATIO case, which is just an optimization.

Agreed with this point. I'll fix it in the next version patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#68

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#67)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If we have the constant threshold of 1 billion transactions, a vacuum
operation might not be an anti-wraparound vacuum and even not be an
aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
the purpose of skipping index vacuuming in this case, I think it
doesn't make sense to have non-aggressive vacuum skip index vacuuming
since it might not be able to advance relfrozenxid. If we have a
constant threshold, 2 billion transactions, maximum value of
autovacuum_freeze_max_age, seems to work.

I like the idea of not making the behavior a special thing that only
happens with a certain variety of VACUUM operation (non-aggressive or
anti-wraparound VACUUMs). Just having a very high threshold should be
enough.

Even if we're not going to be able to advance relfrozenxid, we'll
still finish much earlier and let a new anti-wraparound vacuum take
place that will do that -- and will be able to reuse much of the work
of the original VACUUM. Of course this anti-wraparound vacuum will
also skip index vacuuming from the start (whereas the first VACUUM may
well have done some index vacuuming before deciding to end index
vacuuming to hurry with finishing).

There is a risk in having the limit be too high, though. We need to
give VACUUM time to reach two_pass_strategy() to notice the problem
and act (maybe each call to lazy_vacuum_all_indexes() takes a long
time). Also, while it's possible (any perhaps even likely) that cases
that use this emergency mechanism will be able to end the VACUUM
immediately (because there is enough maintenance_work_mem() to make
the first call to two_pass_strategy() also the last call), that won't
always be how it works. Even deciding to stop index vacuuming (and
heap vacuuming) may not be enough to avert disaster if left too late
-- because we may still have to do a lot of table pruning. In cases
where there is not nearly enough maintenance_work_mem we will get
through the table a lot faster once we decide to skip indexes, but
there is some risk that even this will not be fast enough.

How about 1.8 billion XIDs? That's the maximum value of
autovacuum_freeze_max_age (2 billion) minus the default value (200
million). That is high enough that it seems almost impossible for this
emergency mechanism to hurt rather than help. At the same time it is
not so high that there isn't some remaining time to finish off work
which is truly required.

This seems like a good idea because we should try to avoid changing
back to index vacuuming having decided to skip it once.

Once decided to skip index vacuuming due to too old relfrozenxid
stuff, the decision never be changed within the same vacuum operation,
right? Because the relfrozenxid is advanced at the end of vacuum.

I see no reason why it would be fundamentally incorrect to teach
two_pass_strategy() to make new and independent decisions about doing
index vacuuming on each call. I just don't think that that makes any
sense to do so, practically speaking. Why would we even *want* to
decide to not do index vacuuming, and then change our mind about it
again (resume index vacuuming again, for later heap blocks)? That
sounds a bit too much like me!

There is another reason to never go back to index vacuuming: we should
have an ereport() at the point that we decide to not do index
vacuuming (or not do additional index vacuuming) inside
two_pass_strategy(). This should deliver an unpleasant message to the
DBA. The message is (in my own informal language): An emergency
failsafe mechanism made VACUUM skip index vacuuming, just to avoid
likely XID wraparound failure. This is not supposed to happen.
Consider tuning autovacuum settings, especially if you see this
message regularly.

Obviously the reason to delay the decision is that we cannot easily
predict how long any given VACUUM will take (or just to reach
two_pass_strategy()). Nor can we really hope to understand how many
XIDs will be consumed in that time. So rather than trying to
understand all that, we can instead just wait until we have reliable
information. It is true that the risk of waiting until it's too late
to avert disaster exists (which is why 1.8 billion XIDs seems like a
good threshold to me), but there is only so much we can do about that.
We don't need it to be perfect, just much better.

In my experience, anti-wraparound VACUUM scenarios all have an
"accident chain", which is a concept from the world of aviation and
safety-critical systems:

https://en.wikipedia.org/wiki/Chain_of_events_(accident_analysis)

They usually involve some *combination* of Postgres problems,
application code problems, and DBA error. Not any one thing. I've seen
problems with application code that runs DDL at scheduled intervals,
which interacts badly with vacuum -- but only really on the rare
occasions when freezing is required! I've also seen a buggy
hand-written upsert function that artificially burned through XIDs at
a furious pace. So we really want this mechanism to not rely on the
system being in its typical state, if at all possible. When index
vacuuming is skipped due to concerns about XID wraparound, it should
really be a rare emergency that is a rare and unpleasant surprise to
the DBA. Nobody should rely on this mechanism consistently.

--
Peter Geoghegan

#69

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#64)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

Most anti-wraparound VACUUMs are really not emergencies, though.

That's true, but it's equally true that most of the time it's not
necessary to wear a seatbelt to avoid personal injury. The difficulty
is that it's hard to predict on which occasions it is necessary, and
therefore it is advisable to do it all the time. autovacuum decides
whether an emergency exists, in the first instance, by comparing
age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
for at least two reasons. First, what matters is not when the vacuum
starts, but when the vacuum finishes. A user who has no tables larger
than 100MB can set autovacuum_freeze_max_age a lot closer to the high
limit without risk of hitting it than a user who has a 10TB table. The
time to run vacuum is dependent on both the size of the table and the
applicable cost delay settings, none of which autovacuum knows
anything about. It also knows nothing about the XID consumption rate.
It's relying on the user to set autovacuum_freeze_max_age low enough
that all the anti-wraparound vacuums will finish before the system
crashes into a wall. Second, what happens to one table affects what
happens to other tables. Even if you have perfect knowledge of your
XID consumption rate and the speed at which vacuum will complete, you
can't just configure autovacuum_freeze_max_age to allow exactly enough
time for the vacuum to complete once it hits the threshold, unless you
have one autovacuum worker per table so that the work for that table
never has to wait for work on any other tables. And even then, as you
mention, you have to worry about the possibility that a vacuum was
already in progress on that table itself. Here again, we rely on the
user to know empirically how high they can set
autovacuum_freeze_max_age without cutting it too close.

Now, that's not actually a good thing, because most users aren't smart
enough to do that, and will either leave a gigantic safety margin that
they don't need, or will leave an inadequate safety margin and take
the system down. However, it means we need to be very, very careful
about hard-coded thresholds like 90% of the available XID space. I do
think that there is a case for triggering emergency extra safety
measures when things are looking scary. One that I think would help a
tremendous amount is to start ignoring the vacuum cost delay when
wraparound danger (and maybe even bloat danger) starts to loom.
Perhaps skipping index vacuuming is another such measure, though I
suspect it would help fewer people, because in most of the cases I
see, the system is throttled to use a tiny percentage of its actual
hardware capability. If you're running at 1/5 of the speed of which
the hardware is capable, you can only do better by skipping index
cleanup if that skips more than 80% of page accesses, which could be
true but probably isn't. In reality, I think we probably want both
mechanisms, because they complement each other. If one can save 3X and
the other 4X, the combination is a 12X improvement, which is a big
deal. We might want other things, too.

But ... should the thresholds for triggering these kinds of mechanisms
really be hard-coded with no possibility of being configured in the
field? What if we find out after the release is shipped that the
mechanism works better if you make it kick in sooner, or later, or if
it depends on other things about the system, which I think it almost
certainly does? Thresholds that can't be changed without a recompile
are bad news. That's why we have GUCs.

On another note, I cannot say enough bad things about the function
name two_pass_strategy(). I sincerely hope that you're not planning to
create a function which is a major point of control for VACUUM whose
name gives no hint that it has anything to do with vacuum.

--
Robert Haas
EDB: http://www.enterprisedb.com

#70

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#69)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 18, 2021 at 2:05 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 17, 2021 at 11:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

Most anti-wraparound VACUUMs are really not emergencies, though.

That's true, but it's equally true that most of the time it's not
necessary to wear a seatbelt to avoid personal injury. The difficulty
is that it's hard to predict on which occasions it is necessary, and
therefore it is advisable to do it all the time.

Just to be clear: This was pretty much the point I was making here --
although I guess you're making the broader point about autovacuum and
freezing in general.

The fact that we can *continually* reevaluate if an ongoing VACUUM is
at risk of taking too long is entirely the point here. We can in
principle end index vacuuming dynamically, whenever we feel like it
and for whatever reasons occur to us (hopefully these are good reasons
-- the point is that we get to pick and choose). We can afford to be
pretty aggressive about not giving up, while still having the benefit
of doing that when it *proves* necessary. Because: what are the
chances of the emergency mechanism ending index vacuuming being the
wrong thing to do if we only do that when the system clearly and
measurably has no more than about 10% of the possible XID space to go
before the system becomes unavailable for writes?

What could possibly matter more than that?

By making the decision dynamic, the chances of our
threshold/heuristics causing the wrong behavior become negligible --
even though we're making the decision based on a tiny amount of
(current, authoritative) information. The only novel risk I can think
about is that somebody comes to rely on the mechanism saving the day,
over and over again, rather than fixing a fixable problem.

autovacuum decides
whether an emergency exists, in the first instance, by comparing
age(relfrozenxid) to autovacuum_freeze_max_age, but that's problematic
for at least two reasons. First, what matters is not when the vacuum
starts, but when the vacuum finishes.

To be fair the vacuum_set_xid_limits() mechanism that you refer to
makes perfect sense. It's just totally insufficient for the reasons
you say.

A user who has no tables larger
than 100MB can set autovacuum_freeze_max_age a lot closer to the high
limit without risk of hitting it than a user who has a 10TB table. The
time to run vacuum is dependent on both the size of the table and the
applicable cost delay settings, none of which autovacuum knows
anything about. It also knows nothing about the XID consumption rate.
It's relying on the user to set autovacuum_freeze_max_age low enough
that all the anti-wraparound vacuums will finish before the system
crashes into a wall.

Literally nobody on earth knows what their XID burn rate is when it
really matters. It might be totally out of control that one day of
your life where it truly matters (e.g., due to a recent buggy code
deployment, which I've seen up close). That's how emergencies work.

A dynamic approach is not merely preferable. It seems essential. No
top-down plan is going to be smart enough to predict that it'll take a
really long time to get that one super-exclusive lock on relatively
few pages.

Second, what happens to one table affects what
happens to other tables. Even if you have perfect knowledge of your
XID consumption rate and the speed at which vacuum will complete, you
can't just configure autovacuum_freeze_max_age to allow exactly enough
time for the vacuum to complete once it hits the threshold, unless you
have one autovacuum worker per table so that the work for that table
never has to wait for work on any other tables. And even then, as you
mention, you have to worry about the possibility that a vacuum was
already in progress on that table itself. Here again, we rely on the
user to know empirically how high they can set
autovacuum_freeze_max_age without cutting it too close.

But the VM is a lot more useful when you effectively eliminate index
vacuuming from the picture. And VACUUM has a pretty good understanding
of how that works. Index vacuuming remains the achilles' heel, and I
think that avoiding it in some cases has tremendous value. It has
outsized importance now because we've significantly ameliorated the
problems in the heap, by having the visibility map. What other factor
can make VACUUM take 10x longer than usual on occasion?

Autovacuum scheduling is essentially a top-down model of the needs of
the system -- and one with a lot of flaws. IMV we can make the model's
simplistic view of reality better by making the reality better (i.e.
simpler, more tolerant of stressors) instead of making the model
better.

Now, that's not actually a good thing, because most users aren't smart
enough to do that, and will either leave a gigantic safety margin that
they don't need, or will leave an inadequate safety margin and take
the system down. However, it means we need to be very, very careful
about hard-coded thresholds like 90% of the available XID space. I do
think that there is a case for triggering emergency extra safety
measures when things are looking scary. One that I think would help a
tremendous amount is to start ignoring the vacuum cost delay when
wraparound danger (and maybe even bloat danger) starts to loom.

We've done a lot to ameliorate that problem in recent releases, simply
by updating the defaults.

Perhaps skipping index vacuuming is another such measure, though I
suspect it would help fewer people, because in most of the cases I
see, the system is throttled to use a tiny percentage of its actual
hardware capability. If you're running at 1/5 of the speed of which
the hardware is capable, you can only do better by skipping index
cleanup if that skips more than 80% of page accesses, which could be
true but probably isn't.

The proper thing for VACUUM to be throttled on these days is dirtying
pages. Skipping index vacuuming and skipping the second pass over the
heap will both make an enormous difference in many cases, precisely
because they'll avoid dirtying nearly so many pages. Especially in the
really bad cases, which are precisely where we see problems. Think
about how many pages you'll dirty with a UUID-based index with regular
churn from updates. Plus indexes don't have a visibility map. Whereas
an append-mostly pattern is common with the largest tables.

Perhaps it doesn't matter, but FWIW I think that you're drastically
underestimating the extent to which index vacuuming is now the
problem, in a certain important sense. I think that skipping index
vacuuming and heap vacuuming (i.e. just doing the bare minimum,
pruning) will in fact reduce the number of page accesses by 80% in
many many cases. But I suspect it makes an even bigger difference in
the cases where users are at most risk of wraparound related outages
to begin with. ISTM that you're focussing too much on the everyday
cases, the majority, which are not the cases where everything truly
falls apart. The extremes really matter.

Index vacuuming gets really slow when we're low on
maintenance_work_mem -- horribly slow. Whereas that doesn't matter at
all if you skip indexes. What do you think are the chances that that
was a major factor in those sites that actually had an outage in the
end? My intuition is that eliminating worst-case variability is the
really important thing here. Heap vacuuming just doesn't have that
multiplicative quality. Its costs tend to be proportionate to the
workload, and stable over time.

But ... should the thresholds for triggering these kinds of mechanisms
really be hard-coded with no possibility of being configured in the
field? What if we find out after the release is shipped that the
mechanism works better if you make it kick in sooner, or later, or if
it depends on other things about the system, which I think it almost
certainly does? Thresholds that can't be changed without a recompile
are bad news. That's why we have GUCs.

I'm fine with a GUC, though only for the emergency mechanism. The
default really matters, though -- it shouldn't be necessary to tune
(since we're trying to address a problem that many people don't know
they have until it's too late). I still like 1.8 billion XIDs as the
value -- I propose that that be made the default.

On another note, I cannot say enough bad things about the function
name two_pass_strategy(). I sincerely hope that you're not planning to
create a function which is a major point of control for VACUUM whose
name gives no hint that it has anything to do with vacuum.

You always hate my names for things. But that's fine by me -- I'm
usually not very attached to them. I'm happy to change it to whatever
you prefer.

FWIW, that name was intended to highlight that VACUUMs with indexes
will now always use the two-pass strategy. This is not to be confused
with the one-pass strategy, which is now strictly used on tables with
no indexes -- this even includes the INDEX_CLEANUP=off case with the
patch.

--
Peter Geoghegan

#71

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#63)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 17, 2021 at 7:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Attached is v4. This revision of the patch series is split up into
smaller pieces for easier review. There are now 3 patches in the
series:

1. A refactoring patch that takes code from lazy_scan_heap() and
breaks it into several new functions.

Not too many changes compared to the last revision here (mostly took
things out and put them in the second patch).

2. A patch to remove the tupgone case.

Severa new and interesting changes here -- see below.

3. The patch to optimize VACUUM by teaching it to skip index and heap
vacuuming in certain cases where we only expect a very small benefit.

No changes at all in the third patch.

We now go further with removing unnecessary stuff in WAL records in
the second patch. We also go further with simplifying heap page
vacuuming more generally.

I have invented a new record that is only used by heap page vacuuming.
This means that heap page pruning and heap page vacuuming no longer
share the same xl_heap_clean/XLOG_HEAP2_CLEAN WAL record (which is
what they do today, on master). Rather, there are two records:

* XLOG_HEAP2_PRUNE/xl_heap_prune -- actually just the new name for
xl_heap_clean, renamed to reflect the fact that only pruning uses it.

* XLOG_HEAP2_VACUUM/xl_heap_vacuum -- this one is truly new, though
it's actually just a very primitive version of xl_heap_prune -- since
of course heap page vacuuming is now so much simpler.

I have also taught heap page vacuuming (not pruning) that it only
needs a regular exclusive buffer lock -- there is no longer any need
for a super-exclusive buffer lock. And, heap vacuuming/xl_heap_vacuum
records don't deal with recovery conflicts. These two changes to heap
vacuuming (not pruning) are not additional performance optimizations,
at least to me. I did things this way in v4 because it just made
sense. We don't require index vacuuming to use super-exclusive locks
[1]: It's true that sometimes index vacuuming uses super-exclusive locks, but that isn't essential and is probably bad and unnecessary in the case of nbtree. Note that GiST is fine with just an exclusive lock today, to give one example, even though gistvacuumpage() is based closely on nbtree's btvacuumpage() function.
generate its own recovery conflicts (pruning is assumed to take care
of all that in every index AM, bar none). So why would we continue to
require heap vacuuming to do either of these things now?

This patch is intended to make index vacuuming and heap vacuuming very
similar. Not just because it facilitates work like the work in the
third patch -- it also happens to make perfect sense.

[1]: It's true that sometimes index vacuuming uses super-exclusive locks, but that isn't essential and is probably bad and unnecessary in the case of nbtree. Note that GiST is fine with just an exclusive lock today, to give one example, even though gistvacuumpage() is based closely on nbtree's btvacuumpage() function.
locks, but that isn't essential and is probably bad and unnecessary in
the case of nbtree. Note that GiST is fine with just an exclusive lock
today, to give one example, even though gistvacuumpage() is based
closely on nbtree's btvacuumpage() function.

--
Peter Geoghegan

Attachments:

v4-0001-Refactor-vacuumlazy.c.patchapplication/octet-stream; name=v4-0001-Refactor-vacuumlazy.c.patchDownload

From 8b1bc24566a2e732177fbecae849570554a797e1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v4 1/3] Refactor vacuumlazy.c.

Break up lazy_scan_heap() into functions.

Aside from being useful cleanup work in its own right, this is also
preparation for an upcoming patch that removes the "tupgone" special
case from vacuumlazy.c.
---
 src/backend/access/heap/vacuumlazy.c  | 1384 +++++++++++++++----------
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    8 +-
 3 files changed, 832 insertions(+), 568 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..6382393516 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -294,8 +294,6 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -330,9 +328,47 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
+/*
+ * Counters maintained by lazy_scan_heap() (and scan_prune_page()):
+ */
+typedef struct LVTempCounters
+{
+	double		num_tuples;		/* total number of nonremovable tuples */
+	double		live_tuples;	/* live tuples (reltuples estimate) */
+	double		tups_vacuumed;	/* tuples cleaned up by current vacuum */
+	double		nkeep;			/* dead-but-not-removable tuples */
+	double		nunused;		/* # existing unused line pointers */
+} LVTempCounters;
+
+/*
+ * State output by scan_prune_page():
+ */
+typedef struct LVPrunePageState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_dead_items; /* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPrunePageState;
+
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * scan_prune_page()) that represents VM bit status.
+ *
+ * Used by scan_setvmbit_page() when we're done pruning.
+ */
+typedef struct LVVisMapPageState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVVisMapPageState;
+
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
+static TransactionId RelFrozenXid;
+static MultiXactId RelMinMxid;
+
 static TransactionId OldestXmin;
 static TransactionId FreezeLimit;
 static MultiXactId MultiXactCutoff;
@@ -344,6 +380,10 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+							  Relation *Irel, IndexBulkDeleteResult **indstats,
+							  int nindexes, LVParallelState *lps,
+							  VacOptTernaryValue index_cleanup);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -363,7 +403,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 bool hasindex);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -448,10 +489,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -474,6 +511,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	vac_strategy = bstrategy;
 
+	RelFrozenXid = onerel->rd_rel->relfrozenxid;
+	RelMinMxid = onerel->rd_rel->relminmxid;
+
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(RelFrozenXid));
+	Assert(MultiXactIdIsValid(RelMinMxid));
+
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -509,8 +553,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
 	/*
 	 * Setup error traceback support for ereport().  The idea is to set up an
@@ -740,6 +782,555 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+scan_new_page(Relation onerel, Buffer buf)
+{
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				LVRelStats *vacrelstats)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.
+ */
+static void
+scan_prune_page(Relation onerel, Buffer buf,
+				LVRelStats *vacrelstats,
+				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
+				LVTempCounters *c, LVPrunePageState *ps,
+				LVVisMapPageState *vms,
+				VacOptTernaryValue index_cleanup)
+{
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	int			nfrozen,
+				ndead;
+	LVTempCounters pc;
+	OffsetNumber deaditems[MaxHeapTuplesPerPage];
+	bool		tupgone;
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+	/* Initialize (or reset) page-level counters */
+	pc.num_tuples = 0;
+	pc.live_tuples = 0;
+	pc.tups_vacuumed = 0;
+	pc.nkeep = 0;
+	pc.nunused = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
+									   InvalidTransactionId, 0, false,
+									   &vacrelstats->latestRemovedXid,
+									   &vacrelstats->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	ps->hastup = false;
+	ps->has_dead_items = false;
+	ps->all_visible = true;
+	ps->all_frozen = true;
+	nfrozen = 0;
+	ndead = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	tupgone = false;
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrelstats->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			pc.nunused += 1;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			ps->hastup = true;	/* this page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tups_vacuumed, else we'd be double-counting (at least
+		 * in the common case where heap_page_prune() just freed up a non-HOT
+		 * tuple).
+		 *
+		 * Note also that the final tups_vacuumed value might be very low for
+		 * tables where opportunistic page pruning happens to occur very
+		 * frequently (via heap_page_prune_opt() calls that free up non-HOT
+		 * tuples).
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deaditems[ndead++] = offnum;
+			ps->all_visible = false;
+			ps->has_dead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					pc.nkeep += 1;
+				else
+					tupgone = true; /* we can delete the tuple */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				pc.live_tuples += 1;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (ps->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin, vms->visibility_cutoff_xid))
+						vms->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				pc.nkeep += 1;
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				ps->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				pc.live_tuples += 1;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			deaditems[ndead++] = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrelstats->latestRemovedXid);
+			pc.tups_vacuumed += 1;
+			ps->has_dead_items = true;
+		}
+		else
+		{
+			bool		tuple_totally_frozen;
+
+			pc.num_tuples += 1;
+			ps->hastup = true;
+
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  RelFrozenXid, RelMinMxid,
+										  FreezeLimit, MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;
+
+			pc.num_tuples += 1;
+			ps->hastup = true;
+
+			if (!tuple_totally_frozen)
+				ps->all_frozen = false;
+		}
+	}
+
+	/*
+	 * Success -- we're done pruning, and have determined which tuples are to
+	 * be recorded as dead in local array.  We've also prepared the details of
+	 * which remaining tuples are to be frozen.
+	 *
+	 * First clear the offset information once we have processed all the
+	 * tuples on the page.
+	 */
+	vacrelstats->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Next add page level counters to caller's counts
+	 */
+	c->num_tuples += pc.num_tuples;
+	c->live_tuples += pc.live_tuples;
+	c->tups_vacuumed += pc.tups_vacuumed;
+	c->nkeep += pc.nkeep;
+	c->nunused += pc.nunused;
+
+	/*
+	 * Now save the local dead items array to VACUUM's dead_tuples array.
+	 */
+	for (int i = 0; i < ndead; i++)
+	{
+		ItemPointerData itemptr;
+
+		ItemPointerSet(&itemptr, blkno, deaditems[i]);
+		lazy_record_dead_tuple(vacrelstats->dead_tuples, &itemptr);
+	}
+
+	/*
+	 * Finally, execute tuple freezing as planned.
+	 *
+	 * If we need to freeze any tuples we'll mark the buffer dirty, and write
+	 * a WAL record recording the changes.  We must log the changes to be
+	 * crash-safe against future truncation of CLOG.
+	 */
+	if (nfrozen > 0)
+	{
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			ItemId		itemid;
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(onerel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(onerel, buf, FreezeLimit,
+									 frozen, nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+scan_setvmbit_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				   LVPrunePageState *ps, LVVisMapPageState *vms)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (ps->all_visible && !vms->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (ps->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, vms->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (vms->all_visible_according_to_vm && !PageIsAllVisible(page) &&
+			 VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && ps->has_dead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (vms->all_visible_according_to_vm && ps->all_visible &&
+			 ps->all_frozen && !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -748,9 +1339,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke two_pass_strategy to vacuum indexes and
+ *		mark dead line pointers for reuse via a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -775,23 +1366,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
 	IndexBulkDeleteResult **indstats;
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
@@ -802,6 +1382,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	LVTempCounters c;
+
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -817,18 +1402,24 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	/* Initialize counters */
+	c.num_tuples = 0;
+	c.live_tuples = 0;
+	c.tups_vacuumed = 0;
+	c.nkeep = 0;
+	c.nunused = 0;
 
 	indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	nblocks = RelationGetNumberOfBlocks(onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->scanned_pages = 0;
 	vacrelstats->tupcount_pages = 0;
 	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(onerel);
 
@@ -837,7 +1428,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -865,7 +1456,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes > 0);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -920,7 +1511,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -953,20 +1543,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVVisMapPageState vms;
+		LVPrunePageState ps;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/* Initialize vm state for block: */
+		vms.all_visible_according_to_vm = false;
+		vms.visibility_cutoff_xid = InvalidTransactionId;
+
+		/* Note: Can't touch ps until we reach scan_prune_page() */
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
@@ -1018,7 +1610,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
 			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+				vms.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1045,12 +1637,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					vacrelstats->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			vms.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1069,23 +1664,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, indstats,
-									vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
+							  lps, params->index_cleanup);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be
+			 * worthwhile since we could have updated the freespace of empty
+			 * pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1096,22 +1684,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
 		visibilitymap_pin(onerel, blkno, &vmbuffer);
 
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vac_strategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1168,6 +1763,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrelstats->scanned_pages++;
 		vacrelstats->tupcount_pages++;
 
@@ -1175,399 +1776,84 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			scan_new_page(onerel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			scan_empty_page(onerel, buf, vmbuffer, vacrelstats);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
+						&c, &ps, &vms, params->index_cleanup);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before scan_prune_page()) then the page
+		 * will be visited again by lazy_vacuum_heap(), which will compute and
+		 * record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (nindexes > 0 && ps.has_dead_items &&
+			params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrelstats->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+			/* Wait until lazy_vacuum_heap() to save free space */
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/*
+			 * Will never reach lazy_vacuum_heap() (or will, but won't reach
+			 * this specific page)
+			 */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (nindexes == 0 && ps.has_dead_items)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap().
 			 */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure scan_setvmbit_page() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
+			 */
+			ps.has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1584,109 +1870,27 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		scan_setvmbit_page(onerel, buf, vmbuffer, &ps, &vms);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (ps.hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1698,14 +1902,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrelstats->tuples_deleted = c.tups_vacuumed;
+	vacrelstats->new_dead_tuples = c.nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
 														  nblocks,
 														  vacrelstats->tupcount_pages,
-														  live_tuples);
+														  c.live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1724,20 +1928,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats,
-								lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
+						  lps, params->index_cleanup);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
@@ -1745,8 +1943,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	/*
+	 * Do post-vacuum cleanup.
+	 *
+	 * Note that post-vacuum cleanup does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1756,23 +1959,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (ParallelVacuumIsActive(lps))
 		end_parallel_vacuum(indstats, lps, nindexes);
 
-	/* Update index statistics */
-	if (vacrelstats->useindex)
+	/*
+	 * Update index statistics.
+	 *
+	 * Note that updating the statistics does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		update_index_statistics(Irel, indstats, nindexes);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/* If no indexes, make log report that two_pass_strategy() would've made */
+	Assert(nindexes == 0 || vacuumed_pages == 0);
+	if (nindexes == 0)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
 						vacrelstats->relname,
-						tups_vacuumed, vacuumed_pages)));
+						vacrelstats->tuples_deleted, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 c.nkeep, OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 c.nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrelstats->pinskipped_pages),
@@ -1788,18 +1997,76 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": newly pruned %.0f items, found %.0f nonremovable items in %u out of %u pages",
 					vacrelstats->relname,
-					tups_vacuumed, num_tuples,
+					c.tups_vacuumed, c.num_tuples,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ * Remove the collected garbage tuples from the table and its indexes.
  *
- * We process the indexes serially unless we are doing parallel vacuum.
+ * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ */
+static void
+two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+				  Relation *Irel, IndexBulkDeleteResult **indstats, int nindexes,
+				  LVParallelState *lps, VacOptTernaryValue index_cleanup)
+{
+	bool		skipping;
+
+	/* Should not end up here with no indexes */
+	Assert(nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	/* Check whether or not to do index vacuum and heap vacuum */
+	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+		skipping = true;
+	else
+		skipping = false;
+
+	if (!skipping)
+	{
+		/* Okay, we're going to do index vacuuming */
+		lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, lps,
+								nindexes);
+
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * skipped index vacuuming.  Make log report that lazy_vacuum_heap
+		 * would've made.
+		 *
+		 * Don't report tups_vacuumed here because it will be zero here in
+		 * common case where there are no newly pruned LP_DEAD items for this
+		 * VACUUM.  This is roughly consistent with lazy_vacuum_heap(), and
+		 * the similar !useindex ereport() at the end of lazy_scan_heap().
+		 * Note, however, that has_dead_items_pages is # of heap pages with
+		 * one or more LP_DEAD items (could be from us or from another
+		 * VACUUM), not # blocks scanned.
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
+						vacrelstats->relname,
+						vacrelstats->dead_tuples->num_tuples)));
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful not to
+	 * reset latestRemovedXid since we want that value to be valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+}
+
+/*
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Should only be called through two_pass_strategy()
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -1848,17 +2115,14 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
- *	lazy_vacuum_heap() -- second pass over the heap
+ *	lazy_vacuum_heap() -- second pass over the heap for two pass strategy
  *
  *		This routine marks dead tuples as unused and compacts out free
  *		space on their pages.  Pages not having dead tuples recorded from
  *		lazy_scan_heap are not visited at all.
  *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
+ * Should only be called through two_pass_strategy()
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2867,14 +3131,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2899,12 +3163,12 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks, bool hasindex)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
@@ -3024,7 +3288,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * scan_new_page. So if you change anything here, also check that code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -3070,7 +3334,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in scan_new_page. */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..3ac8df7d07 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/scan_new_page nor heap_page_is_all_visible will
+	 * mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..34670c6264 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap/scan_new_page,
+ * but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,8 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * scan_new_page, we can afford to ignore problems and special cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
-- 
2.27.0

v4-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v4-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 8c1b837cd9a2afbba123aaa22aedefe9775b838c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:46:21 -0700
Subject: [PATCH v4 2/3] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, considering
what to do with chains of tuples caused by UPDATEs).  Whereas index
vacuuming and heap vacuuming are now strictly concerned with removing
garbage tuples from a physical data structure that backs the logical
database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++---
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 +++++++++-------------
 src/backend/access/heap/pruneheap.c      |  60 ++++---
 src/backend/access/heap/vacuumlazy.c     | 214 ++++++++++-------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  26 +--
 src/backend/replication/logical/decode.c |   4 +-
 src/backend/storage/page/bufpage.c       |  20 ++-
 src/tools/pgindent/typedefs.list         |   4 +-
 12 files changed, 275 insertions(+), 325 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..e6055d1ecd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Use by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..1d30a92420 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7528,7 +7528,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7947,88 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8500,34 +8418,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8538,12 +8437,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8596,7 +8491,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8608,6 +8503,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9712,15 +9681,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6382393516..cd040e1e99 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -310,7 +310,6 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Used for error callback */
@@ -749,39 +748,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
-}
-
 /*
  * Handle new page during lazy_scan_heap().
  *
@@ -901,22 +867,23 @@ scan_prune_page(Relation onerel, Buffer buf,
 				LVRelStats *vacrelstats,
 				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
 				LVTempCounters *c, LVPrunePageState *ps,
-				LVVisMapPageState *vms,
-				VacOptTernaryValue index_cleanup)
+				LVVisMapPageState *vms)
 {
 	BlockNumber blkno;
 	Page		page;
 	OffsetNumber offnum,
 				maxoff;
+	HTSV_Result tuplestate;
 	int			nfrozen,
 				ndead;
 	LVTempCounters pc;
 	OffsetNumber deaditems[MaxHeapTuplesPerPage];
-	bool		tupgone;
 
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	pc.num_tuples = 0;
 	pc.live_tuples = 0;
@@ -932,12 +899,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	 */
 	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
 									   InvalidTransactionId, 0, false,
-									   &vacrelstats->latestRemovedXid,
 									   &vacrelstats->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set vms.visibility_cutoff_xid it doesn't
+	 * matter -- the newest XMIN on page can't be missed this way.
 	 */
 	ps->hastup = false;
 	ps->has_dead_items = false;
@@ -947,7 +916,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	ndead = 0;
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	tupgone = false;
+#ifdef DEBUG
+
+	/*
+	 * Enable this to debug the retry logic -- it's actually quite hard to hit
+	 * even with this artificial delay
+	 */
+	pg_usleep(10000);
+#endif
 
 	/*
 	 * Note: If you change anything in the loop below, also look at
@@ -959,6 +935,7 @@ scan_prune_page(Relation onerel, Buffer buf,
 	{
 		ItemId		itemid;
 		HeapTupleData tuple;
+		bool		tuple_totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1007,6 +984,17 @@ scan_prune_page(Relation onerel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		tuplestate = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf);
+
+		if (unlikely(tuplestate == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1017,42 +1005,8 @@ scan_prune_page(Relation onerel, Buffer buf,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (tuplestate)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					pc.nkeep += 1;
-				else
-					tupgone = true; /* we can delete the tuple */
-				ps->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1133,38 +1087,22 @@ scan_prune_page(Relation onerel, Buffer buf,
 				break;
 		}
 
-		if (tupgone)
-		{
-			deaditems[ndead++] = offnum;
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrelstats->latestRemovedXid);
-			pc.tups_vacuumed += 1;
-			ps->has_dead_items = true;
-		}
-		else
-		{
-			bool		tuple_totally_frozen;
+		pc.num_tuples += 1;
+		ps->hastup = true;
 
-			pc.num_tuples += 1;
-			ps->hastup = true;
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  RelFrozenXid, RelMinMxid,
+									  FreezeLimit, MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
+			frozen[nfrozen++].offset = offnum;
 
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			if (heap_prepare_freeze_tuple(tuple.t_data,
-										  RelFrozenXid, RelMinMxid,
-										  FreezeLimit, MultiXactCutoff,
-										  &frozen[nfrozen],
-										  &tuple_totally_frozen))
-				frozen[nfrozen++].offset = offnum;
-
-			pc.num_tuples += 1;
-			ps->hastup = true;
-
-			if (!tuple_totally_frozen)
-				ps->all_frozen = false;
-		}
+		if (!tuple_totally_frozen)
+			ps->all_frozen = false;
 	}
 
 	/*
@@ -1801,7 +1739,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * tuple headers left behind following pruning.
 		 */
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
-						&c, &ps, &vms, params->index_cleanup);
+						&c, &ps, &vms);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -2066,7 +2004,12 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
- * Should only be called through two_pass_strategy()
+ * Should only be called through two_pass_strategy().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -2077,9 +2020,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	Assert(!IsParallelWorker());
 	Assert(nindexes > 0);
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2122,7 +2062,12 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
  *		space on their pages.  Pages not having dead tuples recorded from
  *		lazy_scan_heap are not visited at all.
  *
- * Should only be called through two_pass_strategy()
+ * Should only be called through two_pass_strategy().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2158,12 +2103,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 		vacrelstats->blkno = tblk;
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
 									&vmbuffer);
 
@@ -2196,14 +2136,25 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ *	lazy_vacuum_page() -- free page's LP_DEAD items listed in the
+ *					 vacrelstats->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrelstats->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same heap page was visited by scan_prune_page() (i.e. those whose
+ * TID was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
@@ -2236,11 +2187,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2250,12 +2205,19 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2268,10 +2230,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
 								 &visibility_cutoff_xid,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c02c4e7710..c43ce01f4b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1203,10 +1203,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..a5c0931394 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,12 +121,18 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
 		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
 	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
+	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
 		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
@@ -134,12 +140,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +229,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..391caf7396 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -480,8 +480,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..0c4c07503a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -250,14 +250,18 @@ PageAddItemExtended(Page page,
 		/* if no free slot, we'll put it at limit (1st open slot) */
 		if (PageHasFreeLinePointers(phdr))
 		{
-			/*
-			 * Look for "recyclable" (unused) ItemId.  We check for no storage
-			 * as well, just to be paranoid --- unused items should never have
-			 * storage.
-			 */
+			/* Look for "recyclable" (unused) ItemId */
 			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 			{
 				itemId = PageGetItemId(phdr, offsetNumber);
+
+				/*
+				 * We check for no storage as well, just to be paranoid;
+				 * unused items should never have storage.  Assert() that the
+				 * invariant is respected too.
+				 */
+				Assert(ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId));
+
 				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
 					break;
 			}
@@ -676,7 +680,9 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
- * As a side effect, the page's PD_HAS_FREE_LINES hint bit is updated.
+ * Caller had better have a super-exclusive lock on page's buffer.  As a side
+ * effect, the page's PD_HAS_FREE_LINES hint bit is updated in cases where our
+ * caller (the heap prune code) sets one or more line pointers unused.
  */
 void
 PageRepairFragmentation(Page page)
@@ -771,7 +777,7 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
-	/* Set hint bit for PageAddItem */
+	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
 	else
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2f0e..adf7c42a03 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3555,8 +3555,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3568,9 +3566,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v4-0003-Skip-index-vacuuming-dynamically.patchapplication/octet-stream; name=v4-0003-Skip-index-vacuuming-dynamically.patchDownload

From aea093a6902e4d5d953606f58e5549975ae9464a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:51:44 -0700
Subject: [PATCH v4 3/3] Skip index vacuuming dynamically.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
---
 src/include/commands/vacuum.h          |   6 +-
 src/include/utils/rel.h                |  10 +-
 src/backend/access/common/reloptions.c |  39 ++++++--
 src/backend/access/heap/vacuumlazy.c   | 128 ++++++++++++++++++++-----
 src/backend/commands/vacuum.c          |  33 ++++---
 5 files changed, 170 insertions(+), 46 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..4885bbb44c 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -216,8 +217,9 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptIndexCleanupValue index_cleanup;	/* Do index vacuum and cleanup,
+											 * default value depends on
+											 * reloptions */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 5375a37dd1..1c4f5f34d9 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,13 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+typedef enum VacOptIndexCleanupValue
+{
+	VACOPT_CLEANUP_AUTO = 0,
+	VACOPT_CLEANUP_DISABLED,
+	VACOPT_CLEANUP_ENABLED
+} VacOptIndexCleanupValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +311,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptIndexCleanupValue vacuum_index_cleanup;	/* enables index vacuuming
+													 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 	bool		parallel_insert_enabled;	/* enables planner's use of
 											 * parallel insert */
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 5a0ae99750..282978a310 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -140,15 +140,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -501,6 +492,23 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/*
+ * values from VacOptTernaryValue for index_cleanup option.
+ * Allowing boolean values other than "on" and "off" are for
+ * backward compatibility as the option is used to be an
+ * boolean.
+ */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_CLEANUP_AUTO},
+	{"true", VACOPT_CLEANUP_ENABLED},
+	{"false", VACOPT_CLEANUP_DISABLED},
+	{"on", VACOPT_CLEANUP_ENABLED},
+	{"off", VACOPT_CLEANUP_DISABLED},
+	{"1", VACOPT_CLEANUP_ENABLED},
+	{"0", VACOPT_CLEANUP_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -525,6 +533,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_CLEANUP_AUTO,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1865,7 +1884,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)},
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index cd040e1e99..4ea942c2bf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -382,7 +388,8 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 static void two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 							  Relation *Irel, IndexBulkDeleteResult **indstats,
 							  int nindexes, LVParallelState *lps,
-							  VacOptTernaryValue index_cleanup);
+							  VacOptIndexCleanupValue index_cleanup,
+							  BlockNumber has_dead_items_pages, bool onecall);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -485,7 +492,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* measure elapsed time iff autovacuum logging requires it */
@@ -1320,11 +1326,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	bool		calledtwopass = false;
 	LVTempCounters c;
 
 	/* Counters of # blocks in onerel: */
 	BlockNumber empty_pages,
-				vacuumed_pages;
+				vacuumed_pages,
+				has_dead_items_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -1339,7 +1347,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relnamespace,
 						vacrelstats->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = vacuumed_pages = has_dead_items_pages = 0;
 
 	/* Initialize counters */
 	c.num_tuples = 0;
@@ -1602,9 +1610,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Won't be skipping index vacuuming now, since that is only
+			 * something two_pass_strategy() does when dead tuple space hasn't
+			 * been overrun.
+			 */
+			calledtwopass = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
 			two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-							  lps, params->index_cleanup);
+							  lps, params->index_cleanup,
+							  has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1741,6 +1757,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
 						&c, &ps, &vms);
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.  Note that this is exactly the same
+		 * thing as having items that are stored in dead_tuples space, because
+		 * scan_prune_page() doesn't count anything other than LP_DEAD items
+		 * as dead (as of PostgreSQL 14).
+		 */
+		if (ps.has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
 		 * end of loop.  (Also performs extra single pass strategy steps in
@@ -1755,9 +1782,18 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		savefreespace = false;
 		freespace = 0;
 		if (nindexes > 0 && ps.has_dead_items &&
-			params->index_cleanup != VACOPT_TERNARY_DISABLED)
+			params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		{
-			/* Wait until lazy_vacuum_heap() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap() in INDEX_CLEANUP = AUTO case (which is the
+			 * common case) -- two_pass_strategy() might opt to skip index
+			 * vacuuming (and so must skip heap vacuuming).  This is deemed
+			 * okay, because there can't be very much free space when this
+			 * happens.
+			 */
 		}
 		else
 		{
@@ -1869,7 +1905,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
 		two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-						  lps, params->index_cleanup);
+						  lps, params->index_cleanup,
+						  has_dead_items_pages, !calledtwopass);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1884,10 +1921,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Do post-vacuum cleanup.
 	 *
-	 * Note that post-vacuum cleanup does not take place with
+	 * Note that post-vacuum cleanup is supposed to take place when
+	 * two_pass_strategy() decided to skip index vacuuming, but not with
 	 * INDEX_CLEANUP=OFF.
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
 
 	/*
@@ -1900,10 +1938,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Update index statistics.
 	 *
-	 * Note that updating the statistics does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that updating the statistics takes places when two_pass_strategy()
+	 * decided to skip index vacuuming, but not with INDEX_CLEANUP=OFF.
+	 *
+	 * (In practice most index AMs won't have accurate statistics from
+	 * cleanup, but the index AM API allows them to, so we must check.)
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		update_index_statistics(Irel, indstats, nindexes);
 
 	/* If no indexes, make log report that two_pass_strategy() would've made */
@@ -1946,12 +1987,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
- * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
 two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 				  Relation *Irel, IndexBulkDeleteResult **indstats, int nindexes,
-				  LVParallelState *lps, VacOptTernaryValue index_cleanup)
+				  LVParallelState *lps, VacOptIndexCleanupValue index_cleanup,
+				  BlockNumber has_dead_items_pages, bool onecall)
 {
 	bool		skipping;
 
@@ -1959,11 +2002,44 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 	Assert(nindexes > 0);
 	Assert(!IsParallelWorker());
 
-	/* Check whether or not to do index vacuum and heap vacuum */
-	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+	/*
+	 * Check whether or not to do index vacuum and heap vacuum.
+	 *
+	 * We do both index vacuum and heap vacuum if more than
+	 * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+	 * line pointer.  This is normally a case where dead tuples on the heap
+	 * are highly concentrated in relatively few heap blocks, where the
+	 * index's enhanced deletion mechanism that is clever about heap block
+	 * dead tuple concentrations including btree's bottom-up index deletion
+	 * works well.  Also, since we can clean only a few heap blocks, it would
+	 * be a less negative impact in terms of visibility map update.
+	 *
+	 * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+	 * vacrelstats->dead_tuples could have tuples which became dead after
+	 * HOT-pruning but are not marked dead yet.  We do not process them
+	 * because it's a very rare condition, and the next vacuum will process
+	 * them anyway.
+	 */
+	if (index_cleanup == VACOPT_CLEANUP_DISABLED)
 		skipping = true;
-	else
+	else if (index_cleanup == VACOPT_CLEANUP_ENABLED)
 		skipping = false;
+	else if (!onecall)
+		skipping = false;
+	else
+	{
+		BlockNumber rel_pages_threshold;
+
+		Assert(onecall && index_cleanup == VACOPT_CLEANUP_AUTO);
+
+		rel_pages_threshold =
+			(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		if (has_dead_items_pages < rel_pages_threshold)
+			skipping = true;
+		else
+			skipping = false;
+	}
 
 	if (!skipping)
 	{
@@ -1988,10 +2064,18 @@ two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
 		 * one or more LP_DEAD items (could be from us or from another
 		 * VACUUM), not # blocks scanned.
 		 */
-		ereport(elevel,
-				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
-						vacrelstats->relname,
-						vacrelstats->dead_tuples->num_tuples)));
+		if (index_cleanup == VACOPT_CLEANUP_AUTO)
+			ereport(elevel,
+					(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
+		else
+			ereport(elevel,
+					(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
 	}
 
 	/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..0d3aece45b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -108,7 +108,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	ListCell   *lc;
 
 	/* Set default value */
-	params.index_cleanup = VACOPT_TERNARY_DEFAULT;
+	params.index_cleanup = VACOPT_CLEANUP_AUTO;
 	params.truncate = VACOPT_TERNARY_DEFAULT;
 
 	/* By default parallel vacuum is enabled */
@@ -140,7 +140,14 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		else if (strcmp(opt->defname, "disable_page_skipping") == 0)
 			disable_page_skipping = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "index_cleanup") == 0)
-			params.index_cleanup = get_vacopt_ternary_value(opt);
+		{
+			if (opt->arg == NULL || strcmp(defGetString(opt), "auto") == 0)
+				params.index_cleanup = VACOPT_CLEANUP_AUTO;
+			else if (defGetBoolean(opt))
+				params.index_cleanup = VACOPT_CLEANUP_ENABLED;
+			else
+				params.index_cleanup = VACOPT_CLEANUP_DISABLED;
+		}
 		else if (strcmp(opt->defname, "process_toast") == 0)
 			process_toast = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "truncate") == 0)
@@ -1880,15 +1887,19 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option based on reloptions if not set to either ON or
+	 * OFF.  Note that an VACUUM(INDEX_CLEANUP=AUTO) command is interpreted as
+	 * "prefer reloption, but if it's not set dynamically determine if index
+	 * vacuuming and cleanup" takes place in vacuumlazy.c.  Note also that the
+	 * reloption might be explicitly set to AUTO.
+	 *
+	 * XXX: Do we really want that?
+	 */
+	if (params->index_cleanup == VACOPT_CLEANUP_AUTO &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
 	/* Set truncate option based on reloptions if not yet */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
-- 
2.27.0

#72

Greg Stark

stark@mit.edu

almost 5 years ago

In reply to: Peter Geoghegan (#68)

Re: New IndexAM API controlling index vacuum strategies

On Thu, 18 Mar 2021 at 14:37, Peter Geoghegan <pg@bowt.ie> wrote:

They usually involve some *combination* of Postgres problems,
application code problems, and DBA error. Not any one thing. I've seen
problems with application code that runs DDL at scheduled intervals,
which interacts badly with vacuum -- but only really on the rare
occasions when freezing is required!

What I've seen is an application that regularly ran ANALYZE on a
table. This worked fine as long as vacuums took less than the interval
between analyzes (in this case 1h) but once vacuum started taking
longer than that interval autovacuum would cancel it every time due to
the conflicting lock.

That would have just continued until the wraparound vacuum which
wouldn't self-cancel except that there was also a demon running which
would look for sessions stuck on a lock and kill the blocker -- which
included killing the wraparound vacuum.

And yes, this demon is obviously a terrible idea but of course it was
meant for killing buggy user queries. It wasn't expecting to find
autovacuum jobs blocking things. The real surprise for that user was
that VACUUM could be blocked by things that someone would reasonably
want to run regularly like ANALYZE.

--
greg

#73

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Greg Stark (#72)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Mar 21, 2021 at 1:24 AM Greg Stark <stark@mit.edu> wrote:

What I've seen is an application that regularly ran ANALYZE on a
table. This worked fine as long as vacuums took less than the interval
between analyzes (in this case 1h) but once vacuum started taking
longer than that interval autovacuum would cancel it every time due to
the conflicting lock.

That would have just continued until the wraparound vacuum which
wouldn't self-cancel except that there was also a demon running which
would look for sessions stuck on a lock and kill the blocker -- which
included killing the wraparound vacuum.

That's a new one! Though clearly it's an example of what I described.
I do agree that sometimes the primary cause is the special rules for
cancellations with anti-wraparound autovacuums.

And yes, this demon is obviously a terrible idea but of course it was
meant for killing buggy user queries. It wasn't expecting to find
autovacuum jobs blocking things. The real surprise for that user was
that VACUUM could be blocked by things that someone would reasonably
want to run regularly like ANALYZE.

The infrastructure from my patch to eliminate the tupgone special case
(the patch that fully decouples index and heap vacuuming from pruning
and freezing) ought to enable smarter autovacuum cancellations. It
should be possible to make "canceling" an autovacuum worker actually
instruct the worker to consider the possibility of finishing off the
VACUUM operation very quickly, by simply ending index vacuuming (and
heap vacuuming). It should only be necessary to cancel when that
strategy won't work out, because we haven't finished all required
pruning and freezing yet -- which are the only truly essential tasks
of any "successful" VACUUM operation.

Maybe it would only be appropriate to do something like that for
anti-wraparound VACUUMs, which, as you say, don't get cancelled when
they block the acquisition of a lock (which is a sensible design,
though only because of the specific risk of not managing to advance
relfrozenxid). There wouldn't be a question of canceling an
anti-wraparound VACUUM in the conventional sense with this mechanism.
It would simply instruct the anti-wraparound VACUUM to finish as
quickly as possible by skipping the indexes. Naturally the
implementation wouldn't really need to consider whether that meant the
anti-wraparound VACUUM could end almost immediately, or some time
later -- the point is that it completes ASAP.

--
Peter Geoghegan

#74

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#71)

Re: New IndexAM API controlling index vacuum strategies

On Sat, Mar 20, 2021 at 11:05 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 17, 2021 at 7:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch series splits everything up. There is now a large patch
that removes the tupgone special case, and a second patch that
actually adds code that dynamically decides to not do index vacuuming
in cases where (for whatever reason) it doesn't seem useful.

Attached is v4. This revision of the patch series is split up into
smaller pieces for easier review. There are now 3 patches in the
series:

Thank you for the patches!

1. A refactoring patch that takes code from lazy_scan_heap() and
breaks it into several new functions.

Not too many changes compared to the last revision here (mostly took
things out and put them in the second patch).

I've looked at this 0001 patch and here are some review comments:

+/*
+ *     scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.

Those comments should be a part of 0002 patch?

---
+                       pc.num_tuples += 1;
+                       ps->hastup = true;
+
+                       /*
+                        * Each non-removable tuple must be checked to
see if it needs
+                        * freezing
+                        */
+                       if (heap_prepare_freeze_tuple(tuple.t_data,
+
           RelFrozenXid, RelMinMxid,
+
           FreezeLimit, MultiXactCutoff,
+
           &frozen[nfrozen],
+
           &tuple_totally_frozen))
+                               frozen[nfrozen++].offset = offnum;
+
+                       pc.num_tuples += 1;
+                       ps->hastup = true;

pc.num_tuples is incremented twice. ps->hastup = true is also duplicated.

---
In step 7, with the patch, we save the freespace of the page and do
lazy_vacuum_page(). But should it be done in reverse?

---
+static void
+two_pass_strategy(Relation onerel, LVRelStats *vacrelstats,
+                                 Relation *Irel,
IndexBulkDeleteResult **indstats, int nindexes,
+                                 LVParallelState *lps,
VacOptTernaryValue index_cleanup)

How about renaming to vacuum_two_pass_strategy() or something to clear
this function is used to vacuum?

---
+               /*
+                * skipped index vacuuming.  Make log report that
lazy_vacuum_heap
+                * would've made.
+                *
+                * Don't report tups_vacuumed here because it will be
zero here in
+                * common case where there are no newly pruned LP_DEAD
items for this
+                * VACUUM.  This is roughly consistent with
lazy_vacuum_heap(), and
+                * the similar !useindex ereport() at the end of
lazy_scan_heap().
+                * Note, however, that has_dead_items_pages is # of
heap pages with
+                * one or more LP_DEAD items (could be from us or from another
+                * VACUUM), not # blocks scanned.
+                */
+               ereport(elevel,
+                               (errmsg("\"%s\": INDEX_CLEANUP off
forced VACUUM to not totally remove %d pruned items",
+                                               vacrelstats->relname,
+
vacrelstats->dead_tuples->num_tuples)));

It seems that the comment needs to be updated.

2. A patch to remove the tupgone case.

Severa new and interesting changes here -- see below.

3. The patch to optimize VACUUM by teaching it to skip index and heap
vacuuming in certain cases where we only expect a very small benefit.

I’ll review the other two patches tomorrow.

We now go further with removing unnecessary stuff in WAL records in
the second patch. We also go further with simplifying heap page
vacuuming more generally.

I have invented a new record that is only used by heap page vacuuming.
This means that heap page pruning and heap page vacuuming no longer
share the same xl_heap_clean/XLOG_HEAP2_CLEAN WAL record (which is
what they do today, on master). Rather, there are two records:

* XLOG_HEAP2_PRUNE/xl_heap_prune -- actually just the new name for
xl_heap_clean, renamed to reflect the fact that only pruning uses it.

* XLOG_HEAP2_VACUUM/xl_heap_vacuum -- this one is truly new, though
it's actually just a very primitive version of xl_heap_prune -- since
of course heap page vacuuming is now so much simpler.

I didn't look at the 0002 patch in-depth but the main difference
between those two WAL records is that XLOG_HEAP2_PRUNE has the offset
numbers of unused, redirected, and dead whereas XLOG_HEAP2_VACUUM has
only the offset numbers of unused?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#75

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#70)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 18, 2021 at 9:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

The fact that we can *continually* reevaluate if an ongoing VACUUM is
at risk of taking too long is entirely the point here. We can in
principle end index vacuuming dynamically, whenever we feel like it
and for whatever reasons occur to us (hopefully these are good reasons
-- the point is that we get to pick and choose). We can afford to be
pretty aggressive about not giving up, while still having the benefit
of doing that when it *proves* necessary. Because: what are the
chances of the emergency mechanism ending index vacuuming being the
wrong thing to do if we only do that when the system clearly and
measurably has no more than about 10% of the possible XID space to go
before the system becomes unavailable for writes?

I agree. I was having trouble before understanding exactly what you
are proposing, but this makes sense to me and I agree it's a good
idea.

But ... should the thresholds for triggering these kinds of mechanisms
really be hard-coded with no possibility of being configured in the
field? What if we find out after the release is shipped that the
mechanism works better if you make it kick in sooner, or later, or if
it depends on other things about the system, which I think it almost
certainly does? Thresholds that can't be changed without a recompile
are bad news. That's why we have GUCs.

I'm fine with a GUC, though only for the emergency mechanism. The
default really matters, though -- it shouldn't be necessary to tune
(since we're trying to address a problem that many people don't know
they have until it's too late). I still like 1.8 billion XIDs as the
value -- I propose that that be made the default.

I'm not 100% sure whether we need a new GUC for this or not. I think
that if by default this triggers at the 90% of the hard-shutdown
limit, it would be unlikely, and perhaps unreasonable, for users to
want to raise the limit. However, I wonder whether some users will
want to lower the limit. Would it be reasonable for someone to want to
trigger this at 50% or 70% of XID exhaustion rather than waiting until
things get really bad?

Also, one thing that I dislike about the current system is that, from
a user perspective, when something goes wrong, nothing happens for a
while and then the whole system goes bananas. It seems desirable to me
to find ways of gradually ratcheting up the pressure, like cranking up
the effective cost limit if we can somehow figure out that we're not
keeping up. If, with your mechanism, there's an abrupt point when we
switch from never doing this for any table to always doing this for
every table, that might not be as good as something which does this
"sometimes" and then, if that isn't enough to avoid disaster, does it
"more," and eventually ramps up to doing it always, if trouble
continues. I don't know whether that's possible here, or what it would
look like, or even whether it's appropriate at all in this particular
case, so I just offer it as food for thought.

On another note, I cannot say enough bad things about the function
name two_pass_strategy(). I sincerely hope that you're not planning to
create a function which is a major point of control for VACUUM whose
name gives no hint that it has anything to do with vacuum.

You always hate my names for things. But that's fine by me -- I'm
usually not very attached to them. I'm happy to change it to whatever
you prefer.

My taste in names may be debatable, but including the subsystem name
in the function name seems like a pretty bare-minimum requirement,
especially when the other words in the function name could refer to
just about anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

#76

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#75)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 7:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

I agree. I was having trouble before understanding exactly what you
are proposing, but this makes sense to me and I agree it's a good
idea.

Our ambition is for this to be one big multi-release umbrella project,
with several individual enhancements that each deliver a user-visible
benefit on their own. The fact that we're talking about a few things
at once is confusing, but I think that you need a "grand bargain" kind
of discussion for this. I believe that it actually makes sense to do
it that way, difficult though it may be.

Sometimes the goal is to improve performance, other times the goal is
to improve robustness. Although the distinction gets blurry at the
margins. If VACUUM was infinitely fast (say because of sorcery), then
performance would bee *unbeatable* -- plus we'd never have to worry
about anti-wraparound vacuums not completing in time!

I'm not 100% sure whether we need a new GUC for this or not. I think
that if by default this triggers at the 90% of the hard-shutdown
limit, it would be unlikely, and perhaps unreasonable, for users to
want to raise the limit. However, I wonder whether some users will
want to lower the limit. Would it be reasonable for someone to want to
trigger this at 50% or 70% of XID exhaustion rather than waiting until
things get really bad?

I wanted to avoid inventing a GUC for the mechanism in the third patch
(not the emergency mechanism we're discussing right now, which was
posted separately by Masahiko). I think that a GUC to control skipping
index vacuuming purely because there are very few index tuples to
delete in indexes will become a burden before long. In particular, we
should eventually be able to vacuum some indexes but not others (on
the same table) based on the local needs of each index.

As I keep pointing out, bottom-up index deletion has created a
situation where there can be dramatically different needs among
indexes on the same table -- it can literally prevent 100% of all page
splits from version churn in those indexes that are never subject to
logically changes from non-HOT updates. Whereas bottom-up index
deletion does nothing for any index that is logically updated, for the
obvious reason -- there is now frequently a sharp qualitative
difference among indexes that vacuumlazy.c currently imagines have
basically the same needs. Vacuuming these remaining indexes is a cost
that users will actually understand and accept, too.

But that has nothing to do with the emergency mechanism we're talking
about right now. I actually like your idea of making the emergency
mechanism a GUC. It's equivalent to index_cleanup, except that it is
continuous and dynamic (not discrete and static). The fact that this
GUC expresses what VACUUM should do in terms of the age of the target
table's current relfrozenxid age (and nothing else) seems like exactly
the right thing. As I said before: What else could possibly matter? So
I don't see any risk of such a GUC becoming a burden. I also think
that it's a useful knob to be able to tune. It's also going to help a
lot with testing the feature. So let's have a GUC for that.

Also, one thing that I dislike about the current system is that, from
a user perspective, when something goes wrong, nothing happens for a
while and then the whole system goes bananas. It seems desirable to me
to find ways of gradually ratcheting up the pressure, like cranking up
the effective cost limit if we can somehow figure out that we're not
keeping up. If, with your mechanism, there's an abrupt point when we
switch from never doing this for any table to always doing this for
every table, that might not be as good as something which does this
"sometimes" and then, if that isn't enough to avoid disaster, does it
"more," and eventually ramps up to doing it always, if trouble
continues. I don't know whether that's possible here, or what it would
look like, or even whether it's appropriate at all in this particular
case, so I just offer it as food for thought.

That is exactly the kind of thing that some future highly evolved
version of the broader incremental/dynamic VACUUM design might do.
Your thoughts about the effective delay/throttling lessening as
conditions change is in line with the stuff that we're thinking of
doing. Though I don't believe Masahiko and I have talked about the
delay stuff specifically in any of our private discussions about it.

I am a big believer in the idea that we should have a variety of
strategies that are applied incrementally and dynamically, in response
to an immediate local need (say at the index level). VACUUM should be
able to organically figure out the best strategy (or combination of
strategies) itself, over time. Sometimes it will be very important to
recognize that most indexes have been able to use techniques like
bottom-up index deletion, and so really don't need to be vacuumed at
all. Other times the cost delay stuff will matter much more. Maybe
it's both together, even. The system ought to discover the best
approach dynamically. There will be tremendous variation across tables
and over time -- much too much for anybody to predict and understand
as a practical matter. The intellectually respectable term for what
I'm describing is a complex system.

My work on B-Tree index bloat led me to the idea that sometimes a
variety of strategies can be the real strategy. Take the example of
the benchmark that Victor Yegorov performed, which consisted of a
queue-based workload with deletes, inserts, and updates, plus
constantly holding snapshots for multiple minutes:

/messages/by-id/CAGnEbogATZS1mWMVX8FzZHMXzuDEcb10AnVwwhCtXtiBpg3XLQ@mail.gmail.com

Bottom-up index deletion appeared to practically eliminate index bloat
here. When we only had deduplication (without bottom-up deletion) the
indexes still ballooned in size. But I don't believe that that's a
100% accurate account. I think that it's more accurate to characterize
what we saw there as a case where deduplication and bottom-up deletion
complemented each other to great effect. If deduplication can buy you
time until the next page split (by reducing the space required for
recently dead but not totally dead index tuples caused by version
churn), and if bottom-up index deletion can avoid page splits (by
deleting now-totally-dead index tuples), then we shouldn't be too
surprised to see complementary effects. Though I have to admit that I
was quite surprised at how true this was in the case of Victor's
benchmark -- it worked very well with the workload, without any
designer predicting or understanding anything specific.

--
Peter Geoghegan

#77

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#68)

Re: New IndexAM API controlling index vacuum strategies

On Fri, Mar 19, 2021 at 3:36 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 18, 2021 at 3:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If we have the constant threshold of 1 billion transactions, a vacuum
operation might not be an anti-wraparound vacuum and even not be an
aggressive vacuum, depending on autovacuum_freeze_max_age value. Given
the purpose of skipping index vacuuming in this case, I think it
doesn't make sense to have non-aggressive vacuum skip index vacuuming
since it might not be able to advance relfrozenxid. If we have a
constant threshold, 2 billion transactions, maximum value of
autovacuum_freeze_max_age, seems to work.

I like the idea of not making the behavior a special thing that only
happens with a certain variety of VACUUM operation (non-aggressive or
anti-wraparound VACUUMs). Just having a very high threshold should be
enough.

Even if we're not going to be able to advance relfrozenxid, we'll
still finish much earlier and let a new anti-wraparound vacuum take
place that will do that -- and will be able to reuse much of the work
of the original VACUUM. Of course this anti-wraparound vacuum will
also skip index vacuuming from the start (whereas the first VACUUM may
well have done some index vacuuming before deciding to end index
vacuuming to hurry with finishing).

But we're not sure when the next anti-wraparound vacuum will take
place. Since the table is already vacuumed by a non-aggressive vacuum
with disabling index cleanup, an autovacuum will process the table
when the table gets modified enough or the table's relfrozenxid gets
older than autovacuum_vacuum_max_age. If the new threshold, probably a
new GUC, is much lower than autovacuum_vacuum_max_age and
vacuum_freeze_table_age, the table is continuously vacuumed without
advancing relfrozenxid, leading to unnecessarily index bloat. Given
the new threshold is for emergency purposes (i.g., advancing
relfrozenxid faster), I think it might be better to use
vacuum_freeze_table_age as the lower bound of the new threshold. What
do you think?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#78

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#77)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

But we're not sure when the next anti-wraparound vacuum will take
place. Since the table is already vacuumed by a non-aggressive vacuum
with disabling index cleanup, an autovacuum will process the table
when the table gets modified enough or the table's relfrozenxid gets
older than autovacuum_vacuum_max_age. If the new threshold, probably a
new GUC, is much lower than autovacuum_vacuum_max_age and
vacuum_freeze_table_age, the table is continuously vacuumed without
advancing relfrozenxid, leading to unnecessarily index bloat. Given
the new threshold is for emergency purposes (i.g., advancing
relfrozenxid faster), I think it might be better to use
vacuum_freeze_table_age as the lower bound of the new threshold. What
do you think?

As you know, when the user sets vacuum_freeze_table_age to a value
that is greater than the value of autovacuum_vacuum_max_age, the two
GUCs have values that are contradictory. This contradiction is
resolved inside vacuum_set_xid_limits(), which knows that it should
"interpret" the value of vacuum_freeze_table_age as
(autovacuum_vacuum_max_age * 0.95) to paper-over the user's error.
This 0.95 behavior is documented in the user docs, though it happens
silently.

You seem to be concerned about a similar contradiction. In fact it's
*very* similar contradiction, because this new GUC is naturally a
"sibling GUC" of both vacuum_freeze_table_age and
autovacuum_vacuum_max_age (the "units" are the same, though the
behavior that each GUC triggers is different -- but
vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
*similar and different* in the same way). So perhaps the solution
should be similar -- silently interpret the setting of the new GUC to
resolve the contradiction.

(Maybe I should say "these two new GUCs"? MultiXact variant might be needed...)

This approach has the following advantages:

* It follows precedent.

* It establishes that the new GUC is a logical extension of the
existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs.

* The default value for the new GUC will be so much higher (say 1.8
billion XIDs) than even the default of autovacuum_vacuum_max_age that
it won't disrupt anybody's existing postgresql.conf setup.

* For the same reason (the big space between autovacuum_vacuum_max_age
and the new GUC with default settings), you can almost set the new GUC
without needing to know about autovacuum_vacuum_max_age.

* The overall behavior isn't actually restrictive/paternalistic. That
is, if you know what you're doing (say you're testing the feature) you
can reduce all 3 sibling GUCs to 0 and get the testing behavior that
you desire.

What do you think?

--
Peter Geoghegan

#79

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#78)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 8:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

You seem to be concerned about a similar contradiction. In fact it's
*very* similar contradiction, because this new GUC is naturally a
"sibling GUC" of both vacuum_freeze_table_age and
autovacuum_vacuum_max_age (the "units" are the same, though the
behavior that each GUC triggers is different -- but
vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
*similar and different* in the same way). So perhaps the solution
should be similar -- silently interpret the setting of the new GUC to
resolve the contradiction.

More concretely, maybe the new GUC is forced to be 1.05 of
vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
but so is the existing 0.95 scheme.

There may be some value in picking a scheme that "advertises" that all
three GUCs are symmetrical, or at least related -- all three divide up
the table's XID space.

--
Peter Geoghegan

#80

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#79)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

More concretely, maybe the new GUC is forced to be 1.05 of
vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
but so is the existing 0.95 scheme.

I meant to write 1.05 of autovacuum_vacuum_max_age. So just as
vacuum_freeze_table_age cannot really be greater than 0.95 *
autovacuum_vacuum_max_age, your new GUC cannot really be less than
1.05 * autovacuum_vacuum_max_age.

--
Peter Geoghegan

#81

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#78)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 23, 2021 at 12:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 22, 2021 at 6:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

But we're not sure when the next anti-wraparound vacuum will take
place. Since the table is already vacuumed by a non-aggressive vacuum
with disabling index cleanup, an autovacuum will process the table
when the table gets modified enough or the table's relfrozenxid gets
older than autovacuum_vacuum_max_age. If the new threshold, probably a
new GUC, is much lower than autovacuum_vacuum_max_age and
vacuum_freeze_table_age, the table is continuously vacuumed without
advancing relfrozenxid, leading to unnecessarily index bloat. Given
the new threshold is for emergency purposes (i.g., advancing
relfrozenxid faster), I think it might be better to use
vacuum_freeze_table_age as the lower bound of the new threshold. What
do you think?

As you know, when the user sets vacuum_freeze_table_age to a value
that is greater than the value of autovacuum_vacuum_max_age, the two
GUCs have values that are contradictory. This contradiction is
resolved inside vacuum_set_xid_limits(), which knows that it should
"interpret" the value of vacuum_freeze_table_age as
(autovacuum_vacuum_max_age * 0.95) to paper-over the user's error.
This 0.95 behavior is documented in the user docs, though it happens
silently.

You seem to be concerned about a similar contradiction. In fact it's
*very* similar contradiction, because this new GUC is naturally a
"sibling GUC" of both vacuum_freeze_table_age and
autovacuum_vacuum_max_age (the "units" are the same, though the
behavior that each GUC triggers is different -- but
vacuum_freeze_table_age and autovacuum_vacuum_max_age are both already
*similar and different* in the same way). So perhaps the solution
should be similar -- silently interpret the setting of the new GUC to
resolve the contradiction.

Yeah, that's exactly what I also thought.

(Maybe I should say "these two new GUCs"? MultiXact variant might be needed...)

Yes, I think we should have also for MultiXact.

This approach has the following advantages:

* It follows precedent.

* It establishes that the new GUC is a logical extension of the
existing vacuum_freeze_table_age and autovacuum_vacuum_max_age GUCs.

* The default value for the new GUC will be so much higher (say 1.8
billion XIDs) than even the default of autovacuum_vacuum_max_age that
it won't disrupt anybody's existing postgresql.conf setup.

* For the same reason (the big space between autovacuum_vacuum_max_age
and the new GUC with default settings), you can almost set the new GUC
without needing to know about autovacuum_vacuum_max_age.

* The overall behavior isn't actually restrictive/paternalistic. That
is, if you know what you're doing (say you're testing the feature) you
can reduce all 3 sibling GUCs to 0 and get the testing behavior that
you desire.

What do you think?

Totally agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#82

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#80)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 23, 2021 at 12:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Mar 22, 2021 at 8:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

More concretely, maybe the new GUC is forced to be 1.05 of
vacuum_freeze_table_age. Of course that scheme is a bit arbitrary --
but so is the existing 0.95 scheme.

I meant to write 1.05 of autovacuum_vacuum_max_age. So just as
vacuum_freeze_table_age cannot really be greater than 0.95 *
autovacuum_vacuum_max_age, your new GUC cannot really be less than
1.05 * autovacuum_vacuum_max_age.

That makes sense to me.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#83

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#74)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 10:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Mar 20, 2021 at 11:05 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 17, 2021 at 7:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

2. A patch to remove the tupgone case.

Severa new and interesting changes here -- see below.

3. The patch to optimize VACUUM by teaching it to skip index and heap
vacuuming in certain cases where we only expect a very small benefit.

I’ll review the other two patches tomorrow.

Here are review comments on 0003 patch:

+   /*
+    * Check whether or not to do index vacuum and heap vacuum.
+    *
+    * We do both index vacuum and heap vacuum if more than
+    * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+    * line pointer.  This is normally a case where dead tuples on the heap
+    * are highly concentrated in relatively few heap blocks, where the
+    * index's enhanced deletion mechanism that is clever about heap block
+    * dead tuple concentrations including btree's bottom-up index deletion
+    * works well.  Also, since we can clean only a few heap blocks, it would
+    * be a less negative impact in terms of visibility map update.
+    *
+    * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+    * vacrelstats->dead_tuples could have tuples which became dead after
+    * HOT-pruning but are not marked dead yet.  We do not process them
+    * because it's a very rare condition, and the next vacuum will process
+    * them anyway.
+    */

The second paragraph is no longer true after removing the 'tupegone' case.

---
    if (dead_tuples->num_tuples > 0)
        two_pass_strategy(onerel, vacrelstats, Irel, indstats, nindexes,
-                         lps, params->index_cleanup);
+                         lps, params->index_cleanup,
+                         has_dead_items_pages, !calledtwopass);

Maybe we can use vacrelstats->num_index_scans instead of
calledtwopass? When calling to two_pass_strategy() at the end of
lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
is the first time call, which is equivalent to calledtwopass = false.

---
-           params.index_cleanup = get_vacopt_ternary_value(opt);
+       {
+           if (opt->arg == NULL || strcmp(defGetString(opt), "auto") == 0)
+               params.index_cleanup = VACOPT_CLEANUP_AUTO;
+           else if (defGetBoolean(opt))
+               params.index_cleanup = VACOPT_CLEANUP_ENABLED;
+           else
+               params.index_cleanup = VACOPT_CLEANUP_DISABLED;
+       }

+   /*
+    * Set index cleanup option based on reloptions if not set to either ON or
+    * OFF.  Note that an VACUUM(INDEX_CLEANUP=AUTO) command is interpreted as
+    * "prefer reloption, but if it's not set dynamically determine if index
+    * vacuuming and cleanup" takes place in vacuumlazy.c.  Note also that the
+    * reloption might be explicitly set to AUTO.
+    *
+    * XXX: Do we really want that?
+    */
+   if (params->index_cleanup == VACOPT_CLEANUP_AUTO &&
+       onerel->rd_options != NULL)
+       params->index_cleanup =
+           ((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;

Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
auto, and default? A problem with the above change would be that if
the user wants to do "auto" mode, they might need to reset
vacuum_index_cleanup reloption before executing VACUUM command. In
other words, there is no way in VACUUM command to force "auto" mode.
So I think we can add "auto" value to INDEX_CLEANUP option and ignore
the vacuum_index_cleanup reloption if that value is specified.

Are you updating also the 0003 patch? if you're focusing on 0001 and
0002 patch, I'll update the 0003 patch along with the fourth patch
(skipping index vacuum in emergency cases).

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#84

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#74)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 22, 2021 at 6:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've looked at this 0001 patch and here are some review comments:

+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.

Those comments should be a part of 0002 patch?

Right -- will fix.

pc.num_tuples is incremented twice. ps->hastup = true is also duplicated.

Must have been a mistake when splitting the patch up -- will fix.

---
In step 7, with the patch, we save the freespace of the page and do
lazy_vacuum_page(). But should it be done in reverse?

How about renaming to vacuum_two_pass_strategy() or something to clear
this function is used to vacuum?

Okay. I will rename it to lazy_vacuum_pruned_items().

vacrelstats->dead_tuples->num_tuples)));

It seems that the comment needs to be updated.

Will fix.

I’ll review the other two patches tomorrow.

And I'll respond to your remarks on those (which are already posted
now) separately.

I didn't look at the 0002 patch in-depth but the main difference
between those two WAL records is that XLOG_HEAP2_PRUNE has the offset
numbers of unused, redirected, and dead whereas XLOG_HEAP2_VACUUM has
only the offset numbers of unused?

That's one difference. Another difference is that there is no
latestRemovedXid field. And there is a third difference: we no longer
need a super-exclusive lock for heap page vacuuming (not pruning) with
this design -- which also means that we cannot defragment the page
during heap vacuuming (that's unsafe with only an exclusive lock
because it physically relocates tuples with storage that somebody else
may have a C pointer to that they expect to stay sane). These
differences during original execution of heap page vacuum necessitate
inventing a new REDO routine that does things in exactly the same way.

To put it another way, heap vacuuming is now very similar to index
vacuuming (both are dissimilar to heap pruning). They're very simple,
and 100% a matter of freeing space in physical data structures.
Clearly that's always something that we can put off if it makes sense
to do so. That high level simplicity seems important to me. I always
disliked the way the WAL records for vacuumlazy.c worked. Especially
the XLOG_HEAP2_CLEANUP_INFO record -- that one is terrible.

--
Peter Geoghegan

#85

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#83)

3 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Mar 23, 2021 at 4:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are review comments on 0003 patch:

Attached is a new revision, v5. It fixes bit rot caused by recent
changes (your index autovacuum logging stuff). It has also been
cleaned up in response to your recent review comments -- both from
this email, and the other review email that I responded to separately
today.

+    * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+    * vacrelstats->dead_tuples could have tuples which became dead after
+    * HOT-pruning but are not marked dead yet.  We do not process them
+    * because it's a very rare condition, and the next vacuum will process
+    * them anyway.
+    */

The second paragraph is no longer true after removing the 'tupegone' case.

Fixed.

Maybe we can use vacrelstats->num_index_scans instead of
calledtwopass? When calling to two_pass_strategy() at the end of
lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
is the first time call, which is equivalent to calledtwopass = false.

It's true that when "vacrelstats->num_index_scans > 0" it definitely
can't have been the first call. But how can we distinguish between 1.)
the case where we're being called for the first time, and 2.) the case
where it's the second call, but the first call actually skipped index
vacuuming? When we skip index vacuuming we won't increment
num_index_scans (which seems appropriate to me).

For now I have added an assertion that "vacrelstats->num_index_scan ==
0" at the point where we apply skipping indexes as an optimization
(i.e. the point where the patch 0003- mechanism is applied).

Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
auto, and default? A problem with the above change would be that if
the user wants to do "auto" mode, they might need to reset
vacuum_index_cleanup reloption before executing VACUUM command. In
other words, there is no way in VACUUM command to force "auto" mode.
So I think we can add "auto" value to INDEX_CLEANUP option and ignore
the vacuum_index_cleanup reloption if that value is specified.

I agree that this aspect definitely needs more work. I'll leave it to you to
do this in a separate revision of this new 0003 patch (so no changes here
from me for v5).

Are you updating also the 0003 patch? if you're focusing on 0001 and
0002 patch, I'll update the 0003 patch along with the fourth patch
(skipping index vacuum in emergency cases).

I suggest that you start integrating it with the wraparound emergency
mechanism, which can become patch 0004- of the patch series. You can
manage 0003- and 0004- now. You can post revisions of each of those
two independently of my revisions. What do you think? I have included
0003- for now because you had review comments on it that I worked
through, but you should own that, I think.

I suppose that you should include the versions of 0001- and 0002- you
worked off of, just for the convenience of others/to keep the CF
tester happy. I don't think that I'm going to make many changes that
will break your patch, except for obvious bit rot that can be fixed
through fairly mechanical rebasing.

Thanks
--
Peter Geoghegan

Attachments:

v5-0001-Refactor-vacuumlazy.c.patchapplication/x-patch; name=v5-0001-Refactor-vacuumlazy.c.patchDownload

From f385805e434a67251d546a765b75e4582666f8c6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v5 1/3] Refactor vacuumlazy.c.

Break up lazy_scan_heap() into functions.

Aside from being useful cleanup work in its own right, this is also
preparation for an upcoming patch that removes the "tupgone" special
case from vacuumlazy.c.
---
 src/backend/access/heap/vacuumlazy.c  | 1358 +++++++++++++++----------
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    8 +-
 3 files changed, 808 insertions(+), 566 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..9bebb94968 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -294,8 +294,6 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -334,9 +332,47 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
+/*
+ * Counters maintained by lazy_scan_heap() (and scan_prune_page()):
+ */
+typedef struct LVTempCounters
+{
+	double		num_tuples;		/* total number of nonremovable tuples */
+	double		live_tuples;	/* live tuples (reltuples estimate) */
+	double		tups_vacuumed;	/* tuples cleaned up by current vacuum */
+	double		nkeep;			/* dead-but-not-removable tuples */
+	double		nunused;		/* # existing unused line pointers */
+} LVTempCounters;
+
+/*
+ * State output by scan_prune_page():
+ */
+typedef struct LVPrunePageState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_dead_items; /* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPrunePageState;
+
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * scan_prune_page()) that represents VM bit status.
+ *
+ * Used by scan_setvmbit_page() when we're done pruning.
+ */
+typedef struct LVVisMapPageState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVVisMapPageState;
+
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
+static TransactionId RelFrozenXid;
+static MultiXactId RelMinMxid;
+
 static TransactionId OldestXmin;
 static TransactionId FreezeLimit;
 static MultiXactId MultiXactCutoff;
@@ -348,6 +384,10 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
+									 Relation *Irel, int nindexes,
+									 LVParallelState* lps,
+									 VacOptTernaryValue index_cleanup);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -366,7 +406,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 bool hasindex);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -449,10 +490,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -475,6 +512,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	vac_strategy = bstrategy;
 
+	RelFrozenXid = onerel->rd_rel->relfrozenxid;
+	RelMinMxid = onerel->rd_rel->relminmxid;
+
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(RelFrozenXid));
+	Assert(MultiXactIdIsValid(RelMinMxid));
+
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -510,8 +554,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
 	vacrelstats->indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -780,6 +822,531 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+scan_new_page(Relation onerel, Buffer buf)
+{
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				LVRelStats *vacrelstats)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+scan_prune_page(Relation onerel, Buffer buf,
+				LVRelStats *vacrelstats,
+				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
+				LVTempCounters *c, LVPrunePageState *ps,
+				LVVisMapPageState *vms,
+				VacOptTernaryValue index_cleanup)
+{
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	int			nfrozen,
+				ndead;
+	LVTempCounters pc;
+	OffsetNumber deaditems[MaxHeapTuplesPerPage];
+	bool		tupgone;
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+	/* Initialize (or reset) page-level counters */
+	pc.num_tuples = 0;
+	pc.live_tuples = 0;
+	pc.tups_vacuumed = 0;
+	pc.nkeep = 0;
+	pc.nunused = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
+									   InvalidTransactionId, 0, false,
+									   &vacrelstats->latestRemovedXid,
+									   &vacrelstats->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	ps->hastup = false;
+	ps->has_dead_items = false;
+	ps->all_visible = true;
+	ps->all_frozen = true;
+	nfrozen = 0;
+	ndead = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	tupgone = false;
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrelstats->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			pc.nunused += 1;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			ps->hastup = true;	/* this page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tups_vacuumed, else we'd be double-counting (at least
+		 * in the common case where heap_page_prune() just freed up a non-HOT
+		 * tuple).
+		 *
+		 * Note also that the final tups_vacuumed value might be very low for
+		 * tables where opportunistic page pruning happens to occur very
+		 * frequently (via heap_page_prune_opt() calls that free up non-HOT
+		 * tuples).
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deaditems[ndead++] = offnum;
+			ps->all_visible = false;
+			ps->has_dead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					pc.nkeep += 1;
+				else
+					tupgone = true; /* we can delete the tuple */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				pc.live_tuples += 1;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (ps->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin, vms->visibility_cutoff_xid))
+						vms->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				pc.nkeep += 1;
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				ps->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				pc.live_tuples += 1;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			deaditems[ndead++] = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrelstats->latestRemovedXid);
+			pc.tups_vacuumed += 1;
+			ps->has_dead_items = true;
+		}
+		else
+		{
+			bool		tuple_totally_frozen;
+
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  RelFrozenXid, RelMinMxid,
+										  FreezeLimit, MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;
+
+			pc.num_tuples += 1;
+			ps->hastup = true;
+
+			if (!tuple_totally_frozen)
+				ps->all_frozen = false;
+		}
+	}
+
+	/*
+	 * Success -- we're done pruning, and have determined which tuples are to
+	 * be recorded as dead in local array.  We've also prepared the details of
+	 * which remaining tuples are to be frozen.
+	 *
+	 * First clear the offset information once we have processed all the
+	 * tuples on the page.
+	 */
+	vacrelstats->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Next add page level counters to caller's counts
+	 */
+	c->num_tuples += pc.num_tuples;
+	c->live_tuples += pc.live_tuples;
+	c->tups_vacuumed += pc.tups_vacuumed;
+	c->nkeep += pc.nkeep;
+	c->nunused += pc.nunused;
+
+	/*
+	 * Now save the local dead items array to VACUUM's dead_tuples array.
+	 */
+	for (int i = 0; i < ndead; i++)
+	{
+		ItemPointerData itemptr;
+
+		ItemPointerSet(&itemptr, blkno, deaditems[i]);
+		lazy_record_dead_tuple(vacrelstats->dead_tuples, &itemptr);
+	}
+
+	/*
+	 * Finally, execute tuple freezing as planned.
+	 *
+	 * If we need to freeze any tuples we'll mark the buffer dirty, and write
+	 * a WAL record recording the changes.  We must log the changes to be
+	 * crash-safe against future truncation of CLOG.
+	 */
+	if (nfrozen > 0)
+	{
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			ItemId		itemid;
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(onerel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(onerel, buf, FreezeLimit,
+									 frozen, nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+scan_setvmbit_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				   LVPrunePageState *ps, LVVisMapPageState *vms)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (ps->all_visible && !vms->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (ps->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, vms->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (vms->all_visible_according_to_vm && !PageIsAllVisible(page) &&
+			 VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && ps->has_dead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (vms->all_visible_according_to_vm && ps->all_visible &&
+			 ps->all_frozen && !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -788,9 +1355,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke lazy_vacuum_pruned_items to vacuum indexes
+ *		and mark dead line pointers for reuse via a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -815,22 +1382,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
@@ -841,6 +1397,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	LVTempCounters c;
+
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -856,15 +1417,21 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	/* Initialize counters */
+	c.num_tuples = 0;
+	c.live_tuples = 0;
+	c.tups_vacuumed = 0;
+	c.nkeep = 0;
+	c.nunused = 0;
 
 	nblocks = RelationGetNumberOfBlocks(onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->scanned_pages = 0;
 	vacrelstats->tupcount_pages = 0;
 	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(onerel);
 
@@ -873,7 +1440,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -901,7 +1468,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes > 0);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -956,7 +1523,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -989,20 +1555,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVVisMapPageState vms;
+		LVPrunePageState ps;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/* Initialize vm state for block: */
+		vms.all_visible_according_to_vm = false;
+		vms.visibility_cutoff_xid = InvalidTransactionId;
+
+		/* Note: Can't touch ps until we reach scan_prune_page() */
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
@@ -1054,7 +1622,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
 			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+				vms.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1081,12 +1649,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					vacrelstats->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			vms.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1105,22 +1676,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
+									 params->index_cleanup);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be
+			 * worthwhile since we could have updated the freespace of empty
+			 * pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1131,22 +1696,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
 		visibilitymap_pin(onerel, blkno, &vmbuffer);
 
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vac_strategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1203,6 +1775,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrelstats->scanned_pages++;
 		vacrelstats->tupcount_pages++;
 
@@ -1210,399 +1788,84 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			scan_new_page(onerel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			scan_empty_page(onerel, buf, vmbuffer, vacrelstats);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
+						&c, &ps, &vms, params->index_cleanup);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before scan_prune_page()) then the page
+		 * will be visited again by lazy_vacuum_heap(), which will compute and
+		 * record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (nindexes > 0 && ps.has_dead_items &&
+			params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrelstats->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+			/* Wait until lazy_vacuum_heap() to save free space */
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/*
+			 * Will never reach lazy_vacuum_heap() (or will, but won't reach
+			 * this specific page)
+			 */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (nindexes == 0 && ps.has_dead_items)
 		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap().
 			 */
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure scan_setvmbit_page() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
+			 */
+			ps.has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1619,109 +1882,27 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		scan_setvmbit_page(onerel, buf, vmbuffer, &ps, &vms);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (ps.hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1733,14 +1914,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrelstats->tuples_deleted = c.tups_vacuumed;
+	vacrelstats->new_dead_tuples = c.nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
 														  nblocks,
 														  vacrelstats->tupcount_pages,
-														  live_tuples);
+														  c.live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1759,19 +1940,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
+								 params->index_cleanup);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
@@ -1779,8 +1955,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	/*
+	 * Do post-vacuum cleanup.
+	 *
+	 * Note that post-vacuum cleanup does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
 
 	/*
@@ -1790,23 +1971,32 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (ParallelVacuumIsActive(lps))
 		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
 
-	/* Update index statistics */
-	if (vacrelstats->useindex)
+	/*
+	 * Update index statistics.
+	 *
+	 * Note that updating the statistics does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/*
+	 * If no indexes, make log report that lazy_vacuum_pruned_items() would've
+	 * made
+	 */
+	Assert(nindexes == 0 || vacuumed_pages == 0);
+	if (nindexes == 0)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
 						vacrelstats->relname,
-						tups_vacuumed, vacuumed_pages)));
+						vacrelstats->tuples_deleted, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 c.nkeep, OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 c.nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrelstats->pinskipped_pages),
@@ -1822,18 +2012,73 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": newly pruned %.0f items, found %.0f nonremovable items in %u out of %u pages",
 					vacrelstats->relname,
-					tups_vacuumed, num_tuples,
+					c.tups_vacuumed, c.num_tuples,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ * Remove the collected garbage tuples from the table and its indexes.
  *
- * We process the indexes serially unless we are doing parallel vacuum.
+ * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ */
+static void
+lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
+						 Relation *Irel, int nindexes, LVParallelState *lps,
+						 VacOptTernaryValue index_cleanup)
+{
+	bool		skipping;
+
+	/* Should not end up here with no indexes */
+	Assert(nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	/* Check whether or not to do index vacuum and heap vacuum */
+	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+		skipping = true;
+	else
+		skipping = false;
+
+	if (!skipping)
+	{
+		/* Okay, we're going to do index vacuuming */
+		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * skipped index vacuuming.  Make log report that lazy_vacuum_heap
+		 * would've made.
+		 *
+		 * Don't report tups_vacuumed here because it will be zero here in
+		 * common case where there are no newly pruned LP_DEAD items for this
+		 * VACUUM.  This is roughly consistent with lazy_vacuum_heap(), and
+		 * the similar "nindexes == 0" specific ereport() at the end of
+		 * lazy_scan_heap().
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
+						vacrelstats->relname,
+						vacrelstats->dead_tuples->num_tuples)));
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful not to
+	 * reset latestRemovedXid since we want that value to be valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+}
+
+/*
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Should only be called through lazy_vacuum_pruned_items().
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -1882,17 +2127,14 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
- *	lazy_vacuum_heap() -- second pass over the heap
+ *	lazy_vacuum_heap() -- second pass over the heap for two pass strategy
  *
  *		This routine marks dead tuples as unused and compacts out free
  *		space on their pages.  Pages not having dead tuples recorded from
  *		lazy_scan_heap are not visited at all.
  *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
+ * Should only be called through lazy_vacuum_pruned_items().
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2898,14 +3140,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2930,12 +3172,12 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks, bool hasindex)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
@@ -3055,7 +3297,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * scan_new_page. So if you change anything here, also check that code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -3101,7 +3343,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in scan_new_page. */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..3ac8df7d07 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/scan_new_page nor heap_page_is_all_visible will
+	 * mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..34670c6264 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap/scan_new_page,
+ * but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,8 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * scan_new_page, we can afford to ignore problems and special cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
-- 
2.27.0

v5-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/x-patch; name=v5-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From bd09faeec061b370c8ca361b5566d3cbaafbbd39 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:46:21 -0700
Subject: [PATCH v5 2/3] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++--
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 +++++++++-----------
 src/backend/access/heap/pruneheap.c      |  60 +++---
 src/backend/access/heap/vacuumlazy.c     | 228 +++++++++++------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 ++--
 src/backend/replication/logical/decode.c |   4 +-
 src/backend/storage/page/bufpage.c       |  20 +-
 src/tools/pgindent/typedefs.list         |   4 +-
 12 files changed, 299 insertions(+), 321 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..e6055d1ecd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Use by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..1d30a92420 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7528,7 +7528,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7947,88 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8500,34 +8418,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8538,12 +8437,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8596,7 +8491,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8608,6 +8503,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9712,15 +9681,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9bebb94968..132cfcba16 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -310,7 +310,6 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Statistics about indexes */
@@ -789,39 +788,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
-}
-
 /*
  * Handle new page during lazy_scan_heap().
  *
@@ -914,28 +880,50 @@ scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
  *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.
  */
 static void
 scan_prune_page(Relation onerel, Buffer buf,
 				LVRelStats *vacrelstats,
 				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
 				LVTempCounters *c, LVPrunePageState *ps,
-				LVVisMapPageState *vms,
-				VacOptTernaryValue index_cleanup)
+				LVVisMapPageState *vms)
 {
 	BlockNumber blkno;
 	Page		page;
 	OffsetNumber offnum,
 				maxoff;
+	HTSV_Result tuplestate;
 	int			nfrozen,
 				ndead;
 	LVTempCounters pc;
 	OffsetNumber deaditems[MaxHeapTuplesPerPage];
-	bool		tupgone;
 
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	pc.num_tuples = 0;
 	pc.live_tuples = 0;
@@ -951,12 +939,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	 */
 	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
 									   InvalidTransactionId, 0, false,
-									   &vacrelstats->latestRemovedXid,
 									   &vacrelstats->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set vms.visibility_cutoff_xid it doesn't
+	 * matter -- the newest XMIN on page can't be missed this way.
 	 */
 	ps->hastup = false;
 	ps->has_dead_items = false;
@@ -966,7 +956,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	ndead = 0;
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	tupgone = false;
+#ifdef DEBUG
+
+	/*
+	 * Enable this to debug the retry logic -- it's actually quite hard to hit
+	 * even with this artificial delay
+	 */
+	pg_usleep(10000);
+#endif
 
 	/*
 	 * Note: If you change anything in the loop below, also look at
@@ -978,6 +975,7 @@ scan_prune_page(Relation onerel, Buffer buf,
 	{
 		ItemId		itemid;
 		HeapTupleData tuple;
+		bool		tuple_totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1026,6 +1024,17 @@ scan_prune_page(Relation onerel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		tuplestate = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf);
+
+		if (unlikely(tuplestate == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1036,42 +1045,8 @@ scan_prune_page(Relation onerel, Buffer buf,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (tuplestate)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					pc.nkeep += 1;
-				else
-					tupgone = true; /* we can delete the tuple */
-				ps->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1152,35 +1127,22 @@ scan_prune_page(Relation onerel, Buffer buf,
 				break;
 		}
 
-		if (tupgone)
-		{
-			deaditems[ndead++] = offnum;
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrelstats->latestRemovedXid);
-			pc.tups_vacuumed += 1;
-			ps->has_dead_items = true;
-		}
-		else
-		{
-			bool		tuple_totally_frozen;
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  RelFrozenXid, RelMinMxid,
+									  FreezeLimit, MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
+			frozen[nfrozen++].offset = offnum;
 
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			if (heap_prepare_freeze_tuple(tuple.t_data,
-										  RelFrozenXid, RelMinMxid,
-										  FreezeLimit, MultiXactCutoff,
-										  &frozen[nfrozen],
-										  &tuple_totally_frozen))
-				frozen[nfrozen++].offset = offnum;
+		pc.num_tuples += 1;
+		ps->hastup = true;
 
-			pc.num_tuples += 1;
-			ps->hastup = true;
-
-			if (!tuple_totally_frozen)
-				ps->all_frozen = false;
-		}
+		if (!tuple_totally_frozen)
+			ps->all_frozen = false;
 	}
 
 	/*
@@ -1813,7 +1775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * tuple headers left behind following pruning.
 		 */
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
-						&c, &ps, &vms, params->index_cleanup);
+						&c, &ps, &vms);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -2079,6 +2041,11 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
  * Should only be called through lazy_vacuum_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -2088,9 +2055,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	Assert(!IsParallelWorker());
 	Assert(nindexes > 0);
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2135,6 +2099,11 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
  *		lazy_scan_heap are not visited at all.
  *
  * Should only be called through lazy_vacuum_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2170,12 +2139,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 		vacrelstats->blkno = tblk;
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
 									&vmbuffer);
 
@@ -2208,14 +2172,25 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ *	lazy_vacuum_page() -- free page's LP_DEAD items listed in the
+ *					 vacrelstats->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrelstats->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same heap page was visited by scan_prune_page() (i.e. those whose
+ * TID was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
@@ -2248,11 +2223,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2262,12 +2241,19 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2280,10 +2266,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
 								 &visibility_cutoff_xid,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..391caf7396 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -480,8 +480,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..0c4c07503a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -250,14 +250,18 @@ PageAddItemExtended(Page page,
 		/* if no free slot, we'll put it at limit (1st open slot) */
 		if (PageHasFreeLinePointers(phdr))
 		{
-			/*
-			 * Look for "recyclable" (unused) ItemId.  We check for no storage
-			 * as well, just to be paranoid --- unused items should never have
-			 * storage.
-			 */
+			/* Look for "recyclable" (unused) ItemId */
 			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 			{
 				itemId = PageGetItemId(phdr, offsetNumber);
+
+				/*
+				 * We check for no storage as well, just to be paranoid;
+				 * unused items should never have storage.  Assert() that the
+				 * invariant is respected too.
+				 */
+				Assert(ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId));
+
 				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
 					break;
 			}
@@ -676,7 +680,9 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
- * As a side effect, the page's PD_HAS_FREE_LINES hint bit is updated.
+ * Caller had better have a super-exclusive lock on page's buffer.  As a side
+ * effect, the page's PD_HAS_FREE_LINES hint bit is updated in cases where our
+ * caller (the heap prune code) sets one or more line pointers unused.
  */
 void
 PageRepairFragmentation(Page page)
@@ -771,7 +777,7 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
-	/* Set hint bit for PageAddItem */
+	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
 	else
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2f0e..adf7c42a03 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3555,8 +3555,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3568,9 +3566,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v5-0003-Skip-index-vacuuming-dynamically.patchapplication/x-patch; name=v5-0003-Skip-index-vacuuming-dynamically.patchDownload

From 979b6081f4595c605c75beb36ec7f789dd0bad0e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:51:44 -0700
Subject: [PATCH v5 3/3] Skip index vacuuming dynamically.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
---
 src/include/commands/vacuum.h          |   6 +-
 src/include/utils/rel.h                |  10 +-
 src/backend/access/common/reloptions.c |  39 ++++++--
 src/backend/access/heap/vacuumlazy.c   | 133 ++++++++++++++++++++-----
 src/backend/commands/vacuum.c          |  33 ++++--
 5 files changed, 172 insertions(+), 49 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..4885bbb44c 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,6 +21,7 @@
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "utils/rel.h"
 #include "utils/relcache.h"
 
 /*
@@ -216,8 +217,9 @@ typedef struct VacuumParams
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which  verbose logs are activated, -1
 									 * to use default */
-	VacOptTernaryValue index_cleanup;	/* Do index vacuum and cleanup,
-										 * default value depends on reloptions */
+	VacOptIndexCleanupValue index_cleanup;	/* Do index vacuum and cleanup,
+											 * default value depends on
+											 * reloptions */
 	VacOptTernaryValue truncate;	/* Truncate empty pages at the end,
 									 * default value depends on reloptions */
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8eee1c1a83..8040bf76db 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -295,6 +295,13 @@ typedef struct AutoVacOpts
 	float8		analyze_scale_factor;
 } AutoVacOpts;
 
+typedef enum VacOptIndexCleanupValue
+{
+	VACOPT_CLEANUP_AUTO = 0,
+	VACOPT_CLEANUP_DISABLED,
+	VACOPT_CLEANUP_ENABLED
+} VacOptIndexCleanupValue;
+
 typedef struct StdRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
@@ -304,7 +311,8 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table; /* use as an additional catalog relation */
 	int			parallel_workers;	/* max number of parallel workers */
-	bool		vacuum_index_cleanup;	/* enables index vacuuming and cleanup */
+	VacOptIndexCleanupValue vacuum_index_cleanup;	/* enables index vacuuming
+													 * and cleanup */
 	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
 	bool		parallel_insert_enabled;	/* enables planner's use of
 											 * parallel insert */
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 5a0ae99750..282978a310 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -140,15 +140,6 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
-	{
-		{
-			"vacuum_index_cleanup",
-			"Enables index vacuuming and index cleanup",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
-			ShareUpdateExclusiveLock
-		},
-		true
-	},
 	{
 		{
 			"vacuum_truncate",
@@ -501,6 +492,23 @@ relopt_enum_elt_def viewCheckOptValues[] =
 	{(const char *) NULL}		/* list terminator */
 };
 
+/*
+ * values from VacOptTernaryValue for index_cleanup option.
+ * Allowing boolean values other than "on" and "off" are for
+ * backward compatibility as the option is used to be an
+ * boolean.
+ */
+relopt_enum_elt_def vacOptTernaryOptValues[] =
+{
+	{"auto", VACOPT_CLEANUP_AUTO},
+	{"true", VACOPT_CLEANUP_ENABLED},
+	{"false", VACOPT_CLEANUP_DISABLED},
+	{"on", VACOPT_CLEANUP_ENABLED},
+	{"off", VACOPT_CLEANUP_DISABLED},
+	{"1", VACOPT_CLEANUP_ENABLED},
+	{"0", VACOPT_CLEANUP_DISABLED}
+};
+
 static relopt_enum enumRelOpts[] =
 {
 	{
@@ -525,6 +533,17 @@ static relopt_enum enumRelOpts[] =
 		VIEW_OPTION_CHECK_OPTION_NOT_SET,
 		gettext_noop("Valid values are \"local\" and \"cascaded\".")
 	},
+	{
+		{
+			"vacuum_index_cleanup",
+			"Enables index vacuuming and index cleanup",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		vacOptTernaryOptValues,
+		VACOPT_CLEANUP_AUTO,
+		gettext_noop("Valid values are \"on\", \"off\", and \"auto\".")
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1865,7 +1884,7 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, parallel_workers)},
-		{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
+		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
 		offsetof(StdRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, vacuum_truncate)},
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 132cfcba16..27a1e4c74f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -385,8 +391,10 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   bool aggressive);
 static void lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 									 Relation *Irel, int nindexes,
-									 LVParallelState* lps,
-									 VacOptTernaryValue index_cleanup);
+									 LVParallelState *lps,
+									 VacOptIndexCleanupValue index_cleanup,
+									 BlockNumber has_dead_items_pages,
+									 bool onecall);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -486,7 +494,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
-	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
 	/* measure elapsed time iff autovacuum logging requires it */
@@ -1349,7 +1356,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
@@ -1363,7 +1371,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	/* Counters of # blocks in onerel: */
 	BlockNumber empty_pages,
-				vacuumed_pages;
+				vacuumed_pages,
+				has_dead_items_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -1378,7 +1387,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relnamespace,
 						vacrelstats->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = vacuumed_pages = has_dead_items_pages = 0;
 
 	/* Initialize counters */
 	c.num_tuples = 0;
@@ -1638,9 +1647,18 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Definitely won't be skipping index vacuuming due to finding
+			 * very few dead items during this VACUUM operation -- that's only
+			 * something that lazy_vacuum_pruned_items() is willing to do when
+			 * it is only called once during the entire VACUUM operation.
+			 */
+			have_vacuumed_indexes = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
 			lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
-									 params->index_cleanup);
+									 params->index_cleanup,
+									 has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1777,6 +1795,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
 						&c, &ps, &vms);
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.  Note that this is exactly the same
+		 * thing as having items that are stored in dead_tuples space, because
+		 * scan_prune_page() doesn't count anything other than LP_DEAD items
+		 * as dead (as of PostgreSQL 14).
+		 */
+		if (ps.has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
 		 * end of loop.  (Also performs extra single pass strategy steps in
@@ -1791,9 +1820,18 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		savefreespace = false;
 		freespace = 0;
 		if (nindexes > 0 && ps.has_dead_items &&
-			params->index_cleanup != VACOPT_TERNARY_DISABLED)
+			params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		{
-			/* Wait until lazy_vacuum_heap() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap() in INDEX_CLEANUP = AUTO case (which is the
+			 * common case) -- lazy_vacuum_pruned_items() might opt to skip
+			 * index vacuuming (and so must skip heap vacuuming).  This is
+			 * deemed okay, because there can't be very much free space when
+			 * this happens.
+			 */
 		}
 		else
 		{
@@ -1905,7 +1943,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
 		lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
-								 params->index_cleanup);
+								 params->index_cleanup, has_dead_items_pages,
+								 !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1920,10 +1959,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Do post-vacuum cleanup.
 	 *
-	 * Note that post-vacuum cleanup does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that post-vacuum cleanup is supposed to take place when
+	 * lazy_vacuum_pruned_items() decided to skip index vacuuming, but not
+	 * with INDEX_CLEANUP=OFF.
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
 
 	/*
@@ -1936,10 +1976,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Update index statistics.
 	 *
-	 * Note that updating the statistics does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that updating the statistics takes places when
+	 * lazy_vacuum_pruned_items() decided to skip index vacuuming, but not
+	 * with INDEX_CLEANUP=OFF.
+	 *
+	 * (In practice most index AMs won't have accurate statistics from
+	 * cleanup, but the index AM API allows them to, so we must check.)
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_CLEANUP_DISABLED)
 		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
 
 	/*
@@ -1985,12 +2029,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
- * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
 lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 						 Relation *Irel, int nindexes, LVParallelState *lps,
-						 VacOptTernaryValue index_cleanup)
+						 VacOptIndexCleanupValue index_cleanup,
+						 BlockNumber has_dead_items_pages, bool onecall)
 {
 	bool		skipping;
 
@@ -1998,11 +2044,40 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 	Assert(nindexes > 0);
 	Assert(!IsParallelWorker());
 
-	/* Check whether or not to do index vacuum and heap vacuum */
-	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+	/*
+	 * Check whether or not to do index vacuum and heap vacuum.
+	 *
+	 * We do both index vacuum and heap vacuum if more than
+	 * SKIP_VACUUM_PAGES_RATIO of all heap pages have at least one LP_DEAD
+	 * line pointer.  This is normally a case where dead tuples on the heap
+	 * are highly concentrated in relatively few heap blocks, where the
+	 * index's enhanced deletion mechanism that is clever about heap block
+	 * dead tuple concentrations including btree's bottom-up index deletion
+	 * works well.  Also, since we can clean only a few heap blocks, it would
+	 * be a less negative impact in terms of visibility map update.
+	 */
+	if (index_cleanup == VACOPT_CLEANUP_DISABLED)
 		skipping = true;
-	else
+	else if (index_cleanup == VACOPT_CLEANUP_ENABLED)
 		skipping = false;
+	else if (!onecall)
+		skipping = false;
+	else
+	{
+		BlockNumber rel_pages_threshold;
+
+		Assert(onecall);
+		Assert(vacrelstats->num_index_scans == 0);
+		Assert(index_cleanup == VACOPT_CLEANUP_AUTO);
+
+		rel_pages_threshold =
+			(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		if (has_dead_items_pages < rel_pages_threshold)
+			skipping = true;
+		else
+			skipping = false;
+	}
 
 	if (!skipping)
 	{
@@ -2024,10 +2099,18 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 		 * the similar "nindexes == 0" specific ereport() at the end of
 		 * lazy_scan_heap().
 		 */
-		ereport(elevel,
-				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
-						vacrelstats->relname,
-						vacrelstats->dead_tuples->num_tuples)));
+		if (index_cleanup == VACOPT_CLEANUP_AUTO)
+			ereport(elevel,
+					(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
+		else
+			ereport(elevel,
+					(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
 	}
 
 	/*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..0d3aece45b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -108,7 +108,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	ListCell   *lc;
 
 	/* Set default value */
-	params.index_cleanup = VACOPT_TERNARY_DEFAULT;
+	params.index_cleanup = VACOPT_CLEANUP_AUTO;
 	params.truncate = VACOPT_TERNARY_DEFAULT;
 
 	/* By default parallel vacuum is enabled */
@@ -140,7 +140,14 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		else if (strcmp(opt->defname, "disable_page_skipping") == 0)
 			disable_page_skipping = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "index_cleanup") == 0)
-			params.index_cleanup = get_vacopt_ternary_value(opt);
+		{
+			if (opt->arg == NULL || strcmp(defGetString(opt), "auto") == 0)
+				params.index_cleanup = VACOPT_CLEANUP_AUTO;
+			else if (defGetBoolean(opt))
+				params.index_cleanup = VACOPT_CLEANUP_ENABLED;
+			else
+				params.index_cleanup = VACOPT_CLEANUP_DISABLED;
+		}
 		else if (strcmp(opt->defname, "process_toast") == 0)
 			process_toast = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "truncate") == 0)
@@ -1880,15 +1887,19 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	onerelid = onerel->rd_lockInfo.lockRelId;
 	LockRelationIdForSession(&onerelid, lmode);
 
-	/* Set index cleanup option based on reloptions if not yet */
-	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
-	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
-			params->index_cleanup = VACOPT_TERNARY_ENABLED;
-		else
-			params->index_cleanup = VACOPT_TERNARY_DISABLED;
-	}
+	/*
+	 * Set index cleanup option based on reloptions if not set to either ON or
+	 * OFF.  Note that an VACUUM(INDEX_CLEANUP=AUTO) command is interpreted as
+	 * "prefer reloption, but if it's not set dynamically determine if index
+	 * vacuuming and cleanup" takes place in vacuumlazy.c.  Note also that the
+	 * reloption might be explicitly set to AUTO.
+	 *
+	 * XXX: Do we really want that?
+	 */
+	if (params->index_cleanup == VACOPT_CLEANUP_AUTO &&
+		onerel->rd_options != NULL)
+		params->index_cleanup =
+			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup;
 
 	/* Set truncate option based on reloptions if not yet */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
-- 
2.27.0

#86

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#85)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 24, 2021 at 11:44 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Mar 23, 2021 at 4:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are review comments on 0003 patch:

Attached is a new revision, v5. It fixes bit rot caused by recent
changes (your index autovacuum logging stuff). It has also been
cleaned up in response to your recent review comments -- both from
this email, and the other review email that I responded to separately
today.
+    * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+    * vacrelstats->dead_tuples could have tuples which became dead after
+    * HOT-pruning but are not marked dead yet.  We do not process them
+    * because it's a very rare condition, and the next vacuum will process
+    * them anyway.
+    */
The second paragraph is no longer true after removing the 'tupegone' case.
Fixed.

Maybe we can use vacrelstats->num_index_scans instead of
calledtwopass? When calling to two_pass_strategy() at the end of
lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
is the first time call, which is equivalent to calledtwopass = false.

It's true that when "vacrelstats->num_index_scans > 0" it definitely
can't have been the first call. But how can we distinguish between 1.)
the case where we're being called for the first time, and 2.) the case
where it's the second call, but the first call actually skipped index
vacuuming? When we skip index vacuuming we won't increment
num_index_scans (which seems appropriate to me).

In (2) case, I think we skipped index vacuuming in the first call
because index_cleanup was disabled (if index_cleanup was not disabled,
we didn't skip it because two_pass_strategy() is called with onecall =
false). So in the second call, we skip index vacuuming for the same
reason. Even with the 0004 patch (skipping index vacuuming in
emergency cases), the check of XID wraparound emergency should be done
before the !onecall check in two_pass_strategy() since we should skip
index vacuuming in an emergency case even in the case where
maintenance_work_mem runs out. Therefore, similarly, we will skip
index vacuuming also in the second call.

That being said, I agree that using ‘calledtwopass’ is much readable.
So I’ll keep it as is.

For now I have added an assertion that "vacrelstats->num_index_scan ==
0" at the point where we apply skipping indexes as an optimization
(i.e. the point where the patch 0003- mechanism is applied).

Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
auto, and default? A problem with the above change would be that if
the user wants to do "auto" mode, they might need to reset
vacuum_index_cleanup reloption before executing VACUUM command. In
other words, there is no way in VACUUM command to force "auto" mode.
So I think we can add "auto" value to INDEX_CLEANUP option and ignore
the vacuum_index_cleanup reloption if that value is specified.

I agree that this aspect definitely needs more work. I'll leave it to you to
do this in a separate revision of this new 0003 patch (so no changes here
from me for v5).

Are you updating also the 0003 patch? if you're focusing on 0001 and
0002 patch, I'll update the 0003 patch along with the fourth patch
(skipping index vacuum in emergency cases).

I suggest that you start integrating it with the wraparound emergency
mechanism, which can become patch 0004- of the patch series. You can
manage 0003- and 0004- now. You can post revisions of each of those
two independently of my revisions. What do you think? I have included
0003- for now because you had review comments on it that I worked
through, but you should own that, I think.

I suppose that you should include the versions of 0001- and 0002- you
worked off of, just for the convenience of others/to keep the CF
tester happy. I don't think that I'm going to make many changes that
will break your patch, except for obvious bit rot that can be fixed
through fairly mechanical rebasing.

Agreed.

I was just about to post my 0004 patch based on v4 patch series. I'll
update 0003 and 0004 patches based on v5 patch series you just posted,
and post them including 0001 and 0002 patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#87

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#86)

4 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 24, 2021 at 12:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 24, 2021 at 11:44 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Mar 23, 2021 at 4:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Here are review comments on 0003 patch:

Attached is a new revision, v5. It fixes bit rot caused by recent
changes (your index autovacuum logging stuff). It has also been
cleaned up in response to your recent review comments -- both from
this email, and the other review email that I responded to separately
today.
+    * If we skip vacuum, we just ignore the collected dead tuples.  Note that
+    * vacrelstats->dead_tuples could have tuples which became dead after
+    * HOT-pruning but are not marked dead yet.  We do not process them
+    * because it's a very rare condition, and the next vacuum will process
+    * them anyway.
+    */
The second paragraph is no longer true after removing the 'tupegone' case.
Fixed.

Maybe we can use vacrelstats->num_index_scans instead of
calledtwopass? When calling to two_pass_strategy() at the end of
lazy_scan_heap(), if vacrelstats->num_index_scans is 0 it means this
is the first time call, which is equivalent to calledtwopass = false.

It's true that when "vacrelstats->num_index_scans > 0" it definitely
can't have been the first call. But how can we distinguish between 1.)
the case where we're being called for the first time, and 2.) the case
where it's the second call, but the first call actually skipped index
vacuuming? When we skip index vacuuming we won't increment
num_index_scans (which seems appropriate to me).
In (2) case, I think we skipped index vacuuming in the first call
because index_cleanup was disabled (if index_cleanup was not disabled,
we didn't skip it because two_pass_strategy() is called with onecall =
false). So in the second call, we skip index vacuuming for the same
reason. Even with the 0004 patch (skipping index vacuuming in
emergency cases), the check of XID wraparound emergency should be done
before the !onecall check in two_pass_strategy() since we should skip
index vacuuming in an emergency case even in the case where
maintenance_work_mem runs out. Therefore, similarly, we will skip
index vacuuming also in the second call.

That being said, I agree that using ‘calledtwopass’ is much readable.
So I’ll keep it as is.

For now I have added an assertion that "vacrelstats->num_index_scan ==
0" at the point where we apply skipping indexes as an optimization
(i.e. the point where the patch 0003- mechanism is applied).

Perhaps we can make INDEX_CLEANUP option a four-value option: on, off,
auto, and default? A problem with the above change would be that if
the user wants to do "auto" mode, they might need to reset
vacuum_index_cleanup reloption before executing VACUUM command. In
other words, there is no way in VACUUM command to force "auto" mode.
So I think we can add "auto" value to INDEX_CLEANUP option and ignore
the vacuum_index_cleanup reloption if that value is specified.

I agree that this aspect definitely needs more work. I'll leave it to you to
do this in a separate revision of this new 0003 patch (so no changes here
from me for v5).

Are you updating also the 0003 patch? if you're focusing on 0001 and
0002 patch, I'll update the 0003 patch along with the fourth patch
(skipping index vacuum in emergency cases).

I suggest that you start integrating it with the wraparound emergency
mechanism, which can become patch 0004- of the patch series. You can
manage 0003- and 0004- now. You can post revisions of each of those
two independently of my revisions. What do you think? I have included
0003- for now because you had review comments on it that I worked
through, but you should own that, I think.

I suppose that you should include the versions of 0001- and 0002- you
worked off of, just for the convenience of others/to keep the CF
tester happy. I don't think that I'm going to make many changes that
will break your patch, except for obvious bit rot that can be fixed
through fairly mechanical rebasing.

Agreed.

I was just about to post my 0004 patch based on v4 patch series. I'll
update 0003 and 0004 patches based on v5 patch series you just posted,
and post them including 0001 and 0002 patches.

I've attached the updated patch set (nothing changed in 0001 and 0002 patch).

Regarding "auto" option, I think it would be a good start to enable
the index vacuum skipping behavior by default instead of adding “auto”
mode. That is, we could skip index vacuuming if INDEX_CLEANUP ON. With
0003 and 0004 patch, there are two cases where we skip index
vacuuming: the garbage on heap is very concentrated and the table is
at risk of XID wraparound. It seems to make sense to have both
behaviors by default. If we want to have a way to force doing index
vacuuming, we can add “force” option instead of adding “auto” option
and having “on” mode force doing index vacuuming.

Also regarding new GUC parameters, vacuum_skip_index_age and
vacuum_multixact_skip_index_age, those are not autovacuum-dedicated
parameter. VACUUM command also uses those parameters to skip index
vacuuming dynamically. In such an emergency case, it seems appropriate
to me to skip index vacuuming even in VACUUM command. And I don’t add
any reloption for those two parameters. Since those parameters are
unlikely to be changed from the default value, I think it don’t
necessarily need to provide a way for per-table configuration.

In 0001 patch, we have the following chunk:

+   bool        skipping;
+
+   /* Should not end up here with no indexes */
+   Assert(nindexes > 0);
+   Assert(!IsParallelWorker());
+
+   /* Check whether or not to do index vacuum and heap vacuum */
+   if (index_cleanup == VACOPT_TERNARY_DISABLED)
+       skipping = true;
+   else
+       skipping = false;

Can we flip the boolean? I mean to use a positive form such as
"do_vacuum". It seems to be more readable especially for the changes
made in 0003 and 0004 patches.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

v6-0003-Skip-index-vacuuming-dynamically.patchapplication/octet-stream; name=v6-0003-Skip-index-vacuuming-dynamically.patchDownload

From 88ca75aa6dbac1cdf193f33c7ffbc28da269742c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:51:44 -0700
Subject: [PATCH v6 3/4] Skip index vacuuming dynamically.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoAtZb4+HJT_8RoOXvu4HM-Zd4HKS3YSMCH6+-W=bDyh-w@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 127 ++++++++++++++++++++++-----
 1 file changed, 107 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 132cfcba16..ac250d0fab 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -385,8 +391,10 @@ static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   bool aggressive);
 static void lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 									 Relation *Irel, int nindexes,
-									 LVParallelState* lps,
-									 VacOptTernaryValue index_cleanup);
+									 LVParallelState *lps,
+									 VacOptTernaryValue index_cleanup,
+									 BlockNumber has_dead_items_pages,
+									 bool onecall);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -1349,7 +1357,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
@@ -1363,7 +1372,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	/* Counters of # blocks in onerel: */
 	BlockNumber empty_pages,
-				vacuumed_pages;
+				vacuumed_pages,
+				has_dead_items_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -1378,7 +1388,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relnamespace,
 						vacrelstats->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = vacuumed_pages = has_dead_items_pages = 0;
 
 	/* Initialize counters */
 	c.num_tuples = 0;
@@ -1638,9 +1648,18 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Definitely won't be skipping index vacuuming due to finding
+			 * very few dead items during this VACUUM operation -- that's only
+			 * something that lazy_vacuum_pruned_items() is willing to do when
+			 * it is only called once during the entire VACUUM operation.
+			 */
+			have_vacuumed_indexes = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
 			lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
-									 params->index_cleanup);
+									 params->index_cleanup,
+									 has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1777,6 +1796,17 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
 						&c, &ps, &vms);
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.  Note that this is exactly the same
+		 * thing as having items that are stored in dead_tuples space, because
+		 * scan_prune_page() doesn't count anything other than LP_DEAD items
+		 * as dead (as of PostgreSQL 14).
+		 */
+		if (ps.has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
 		 * end of loop.  (Also performs extra single pass strategy steps in
@@ -1793,7 +1823,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		if (nindexes > 0 && ps.has_dead_items &&
 			params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		{
-			/* Wait until lazy_vacuum_heap() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap() in INDEX_CLEANUP = ON case (which is the
+			 * common case) -- lazy_vacuum_pruned_items() might opt to skip
+			 * index vacuuming (and so must skip heap vacuuming).  This is
+			 * deemed okay, because there can't be very much free space when
+			 * this happens.
+			 */
 		}
 		else
 		{
@@ -1905,7 +1944,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
 		lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
-								 params->index_cleanup);
+								 params->index_cleanup, has_dead_items_pages,
+								 !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1920,8 +1960,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Do post-vacuum cleanup.
 	 *
-	 * Note that post-vacuum cleanup does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that post-vacuum cleanup is supposed to take place when
+	 * lazy_vacuum_pruned_items() decided to skip index vacuuming, but not
+	 * with INDEX_CLEANUP=OFF.
 	 */
 	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
@@ -1936,8 +1977,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/*
 	 * Update index statistics.
 	 *
-	 * Note that updating the statistics does not take place with
-	 * INDEX_CLEANUP=OFF.
+	 * Note that updating the statistics takes places when
+	 * lazy_vacuum_pruned_items() decided to skip index vacuuming, but not
+	 * with INDEX_CLEANUP=OFF.
+	 *
+	 * (In practice most index AMs won't have accurate statistics from
+	 * cleanup, but the index AM API allows them to, so we must check.)
 	 */
 	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
@@ -1985,12 +2030,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
- * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
 lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 						 Relation *Irel, int nindexes, LVParallelState *lps,
-						 VacOptTernaryValue index_cleanup)
+						 VacOptTernaryValue index_cleanup,
+						 BlockNumber has_dead_items_pages, bool onecall)
 {
 	bool		skipping;
 
@@ -1998,12 +2045,44 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 	Assert(nindexes > 0);
 	Assert(!IsParallelWorker());
 
-	/* Check whether or not to do index vacuum and heap vacuum */
+	/* Skip index and heap vacuuming if INDEX_CLEANUP=OFF */
 	if (index_cleanup == VACOPT_TERNARY_DISABLED)
 		skipping = true;
-	else
+
+	/*
+	 * Don't skip index and heap vacuuming if it's not only called once during
+	 * the entire vacuum operation.
+	 */
+	else if (!onecall)
 		skipping = false;
 
+	/*
+	 * We do both index and heap vacuuming if more than SKIP_VACUUM_PAGES_RATIO
+	 * of all heap pages have at least one LP_DEAD line pointer.  This is
+	 * normally a case where dead tuples on the heap are highly concentrated
+	 * in relatively few heap blocks, where the index's enhanced deletion
+	 * mechanism that is clever about heap block dead tuple concentrations
+	 * including btree's bottom-up index deletion works well.  Also, since we
+	 * can clean only a few heap blocks, it would be a less negative impact in
+	 * terms of visibility map update.
+	 */
+	else
+	{
+		BlockNumber rel_pages_threshold;
+
+		Assert(onecall);
+		Assert(vacrelstats->num_index_scans == 0);
+		Assert(index_cleanup == VACOPT_TERNARY_ENABLED);
+
+		rel_pages_threshold =
+			(double) vacrelstats->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		if (has_dead_items_pages < rel_pages_threshold)
+			skipping = true;
+		else
+			skipping = false;
+	}
+
 	if (!skipping)
 	{
 		/* Okay, we're going to do index vacuuming */
@@ -2024,10 +2103,18 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 		 * the similar "nindexes == 0" specific ereport() at the end of
 		 * lazy_scan_heap().
 		 */
-		ereport(elevel,
-				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
-						vacrelstats->relname,
-						vacrelstats->dead_tuples->num_tuples)));
+		if (index_cleanup == VACOPT_TERNARY_DISABLED)
+			ereport(elevel,
+					(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
+		else
+			ereport(elevel,
+					(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
+							vacrelstats->relname,
+							vacrelstats->dead_tuples->num_tuples,
+							has_dead_items_pages)));
 	}
 
 	/*
-- 
2.27.0

v6-0001-Refactor-vacuumlazy.c.patchapplication/octet-stream; name=v6-0001-Refactor-vacuumlazy.c.patchDownload

From 1fbf84ac2dc947a5fe82150c22aebfef580af76f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v6 1/4] Refactor vacuumlazy.c.

Break up lazy_scan_heap() into functions.

Aside from being useful cleanup work in its own right, this is also
preparation for an upcoming patch that removes the "tupgone" special
case from vacuumlazy.c.
---
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    8 +-
 src/backend/access/heap/vacuumlazy.c  | 1350 +++++++++++++++----------
 3 files changed, 804 insertions(+), 562 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..3ac8df7d07 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/scan_new_page nor heap_page_is_all_visible will
+	 * mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..34670c6264 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap/scan_new_page,
+ * but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,8 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * scan_new_page, we can afford to ignore problems and special cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..9bebb94968 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -294,8 +294,6 @@ typedef struct LVRelStats
 {
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
 	/* Overall statistics about rel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -334,9 +332,47 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
+/*
+ * Counters maintained by lazy_scan_heap() (and scan_prune_page()):
+ */
+typedef struct LVTempCounters
+{
+	double		num_tuples;		/* total number of nonremovable tuples */
+	double		live_tuples;	/* live tuples (reltuples estimate) */
+	double		tups_vacuumed;	/* tuples cleaned up by current vacuum */
+	double		nkeep;			/* dead-but-not-removable tuples */
+	double		nunused;		/* # existing unused line pointers */
+} LVTempCounters;
+
+/*
+ * State output by scan_prune_page():
+ */
+typedef struct LVPrunePageState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_dead_items; /* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPrunePageState;
+
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * scan_prune_page()) that represents VM bit status.
+ *
+ * Used by scan_setvmbit_page() when we're done pruning.
+ */
+typedef struct LVVisMapPageState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVVisMapPageState;
+
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
+static TransactionId RelFrozenXid;
+static MultiXactId RelMinMxid;
+
 static TransactionId OldestXmin;
 static TransactionId FreezeLimit;
 static MultiXactId MultiXactCutoff;
@@ -348,6 +384,10 @@ static BufferAccessStrategy vac_strategy;
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,
 						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 						   bool aggressive);
+static void lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
+									 Relation *Irel, int nindexes,
+									 LVParallelState* lps,
+									 VacOptTernaryValue index_cleanup);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
@@ -366,7 +406,8 @@ static bool should_attempt_truncation(VacuumParams *params,
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks,
+							 bool hasindex);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -449,10 +490,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -475,6 +512,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	vac_strategy = bstrategy;
 
+	RelFrozenXid = onerel->rd_rel->relfrozenxid;
+	RelMinMxid = onerel->rd_rel->relminmxid;
+
+	/* not every AM requires these to be valid, but heap does */
+	Assert(TransactionIdIsNormal(RelFrozenXid));
+	Assert(MultiXactIdIsValid(RelMinMxid));
+
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -510,8 +554,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
 	vacrelstats->indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
@@ -780,6 +822,531 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
 		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+scan_new_page(Relation onerel, Buffer buf)
+{
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				LVRelStats *vacrelstats)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+scan_prune_page(Relation onerel, Buffer buf,
+				LVRelStats *vacrelstats,
+				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
+				LVTempCounters *c, LVPrunePageState *ps,
+				LVVisMapPageState *vms,
+				VacOptTernaryValue index_cleanup)
+{
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	int			nfrozen,
+				ndead;
+	LVTempCounters pc;
+	OffsetNumber deaditems[MaxHeapTuplesPerPage];
+	bool		tupgone;
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+	/* Initialize (or reset) page-level counters */
+	pc.num_tuples = 0;
+	pc.live_tuples = 0;
+	pc.tups_vacuumed = 0;
+	pc.nkeep = 0;
+	pc.nunused = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
+									   InvalidTransactionId, 0, false,
+									   &vacrelstats->latestRemovedXid,
+									   &vacrelstats->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	ps->hastup = false;
+	ps->has_dead_items = false;
+	ps->all_visible = true;
+	ps->all_frozen = true;
+	nfrozen = 0;
+	ndead = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	tupgone = false;
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrelstats->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			pc.nunused += 1;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			ps->hastup = true;	/* this page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tups_vacuumed, else we'd be double-counting (at least
+		 * in the common case where heap_page_prune() just freed up a non-HOT
+		 * tuple).
+		 *
+		 * Note also that the final tups_vacuumed value might be very low for
+		 * tables where opportunistic page pruning happens to occur very
+		 * frequently (via heap_page_prune_opt() calls that free up non-HOT
+		 * tuples).
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deaditems[ndead++] = offnum;
+			ps->all_visible = false;
+			ps->has_dead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					pc.nkeep += 1;
+				else
+					tupgone = true; /* we can delete the tuple */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				pc.live_tuples += 1;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (ps->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					{
+						ps->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin, vms->visibility_cutoff_xid))
+						vms->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				pc.nkeep += 1;
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				ps->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				ps->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				pc.live_tuples += 1;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			deaditems[ndead++] = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrelstats->latestRemovedXid);
+			pc.tups_vacuumed += 1;
+			ps->has_dead_items = true;
+		}
+		else
+		{
+			bool		tuple_totally_frozen;
+
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  RelFrozenXid, RelMinMxid,
+										  FreezeLimit, MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;
+
+			pc.num_tuples += 1;
+			ps->hastup = true;
+
+			if (!tuple_totally_frozen)
+				ps->all_frozen = false;
+		}
+	}
+
+	/*
+	 * Success -- we're done pruning, and have determined which tuples are to
+	 * be recorded as dead in local array.  We've also prepared the details of
+	 * which remaining tuples are to be frozen.
+	 *
+	 * First clear the offset information once we have processed all the
+	 * tuples on the page.
+	 */
+	vacrelstats->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Next add page level counters to caller's counts
+	 */
+	c->num_tuples += pc.num_tuples;
+	c->live_tuples += pc.live_tuples;
+	c->tups_vacuumed += pc.tups_vacuumed;
+	c->nkeep += pc.nkeep;
+	c->nunused += pc.nunused;
+
+	/*
+	 * Now save the local dead items array to VACUUM's dead_tuples array.
+	 */
+	for (int i = 0; i < ndead; i++)
+	{
+		ItemPointerData itemptr;
+
+		ItemPointerSet(&itemptr, blkno, deaditems[i]);
+		lazy_record_dead_tuple(vacrelstats->dead_tuples, &itemptr);
+	}
+
+	/*
+	 * Finally, execute tuple freezing as planned.
+	 *
+	 * If we need to freeze any tuples we'll mark the buffer dirty, and write
+	 * a WAL record recording the changes.  We must log the changes to be
+	 * crash-safe against future truncation of CLOG.
+	 */
+	if (nfrozen > 0)
+	{
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			ItemId		itemid;
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(onerel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(onerel, buf, FreezeLimit,
+									 frozen, nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+scan_setvmbit_page(Relation onerel, Buffer buf, Buffer vmbuffer,
+				   LVPrunePageState *ps, LVVisMapPageState *vms)
+{
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (ps->all_visible && !vms->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (ps->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, vms->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (vms->all_visible_according_to_vm && !PageIsAllVisible(page) &&
+			 VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && ps->has_dead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (vms->all_visible_according_to_vm && ps->all_visible &&
+			 ps->all_frozen && !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -788,9 +1355,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke lazy_vacuum_pruned_items to vacuum indexes
+ *		and mark dead line pointers for reuse via a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -815,22 +1382,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
@@ -841,6 +1397,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	LVTempCounters c;
+
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -856,15 +1417,21 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						vacrelstats->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	/* Initialize counters */
+	c.num_tuples = 0;
+	c.live_tuples = 0;
+	c.tups_vacuumed = 0;
+	c.nkeep = 0;
+	c.nunused = 0;
 
 	nblocks = RelationGetNumberOfBlocks(onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrelstats->rel_pages = nblocks;
 	vacrelstats->scanned_pages = 0;
 	vacrelstats->tupcount_pages = 0;
 	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(onerel);
 
@@ -873,7 +1440,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
+	if (params->nworkers >= 0 && nindexes > 1)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we
@@ -901,7 +1468,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * initialized.
 	 */
 	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+		lazy_space_alloc(vacrelstats, nblocks, nindexes > 0);
 
 	dead_tuples = vacrelstats->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
@@ -956,7 +1523,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -989,20 +1555,22 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVVisMapPageState vms;
+		LVPrunePageState ps;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/* Initialize vm state for block: */
+		vms.all_visible_according_to_vm = false;
+		vms.visibility_cutoff_xid = InvalidTransactionId;
+
+		/* Note: Can't touch ps until we reach scan_prune_page() */
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
 
@@ -1054,7 +1622,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
 			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+				vms.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1081,12 +1649,15 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					vacrelstats->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			vms.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1105,22 +1676,16 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
+									 params->index_cleanup);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be
+			 * worthwhile since we could have updated the freespace of empty
+			 * pages.
 			 */
 			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
@@ -1131,22 +1696,29 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
 		visibilitymap_pin(onerel, blkno, &vmbuffer);
 
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vac_strategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1203,6 +1775,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrelstats->scanned_pages++;
 		vacrelstats->tupcount_pages++;
 
@@ -1210,399 +1788,84 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			scan_new_page(onerel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			scan_empty_page(onerel, buf, vmbuffer, vacrelstats);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
-		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
-
-		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
+		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
+						&c, &ps, &vms, params->index_cleanup);
 
 		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before scan_prune_page()) then the page
+		 * will be visited again by lazy_vacuum_heap(), which will compute and
+		 * record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (nindexes > 0 && ps.has_dead_items &&
+			params->index_cleanup != VACOPT_TERNARY_DISABLED)
+		{
+			/* Wait until lazy_vacuum_heap() to save free space */
+		}
+		else
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrelstats->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
 			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
+			 * Will never reach lazy_vacuum_heap() (or will, but won't reach
+			 * this specific page)
 			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
+		}
 
-			tupgone = false;
+		if (nindexes == 0 && ps.has_dead_items)
+		{
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
+			 * One pass strategy (no indexes) case.
 			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap().
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
+			lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+			vacuumed_pages++;
 
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
-		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
-		}
-
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
-		{
-			if (nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * Make sure scan_setvmbit_page() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
 			 */
+			ps.has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1619,109 +1882,27 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		scan_setvmbit_page(onerel, buf, vmbuffer, &ps, &vms);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
-		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (ps.hastup)
 			vacrelstats->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1733,14 +1914,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrelstats->tuples_deleted = c.tups_vacuumed;
+	vacrelstats->new_dead_tuples = c.nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
 														  nblocks,
 														  vacrelstats->tupcount_pages,
-														  live_tuples);
+														  c.live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1759,19 +1940,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
-	}
+		lazy_vacuum_pruned_items(onerel, vacrelstats, Irel, nindexes, lps,
+								 params->index_cleanup);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
@@ -1779,8 +1955,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
+	/*
+	 * Do post-vacuum cleanup.
+	 *
+	 * Note that post-vacuum cleanup does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
 
 	/*
@@ -1790,23 +1971,32 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (ParallelVacuumIsActive(lps))
 		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
 
-	/* Update index statistics */
-	if (vacrelstats->useindex)
+	/*
+	 * Update index statistics.
+	 *
+	 * Note that updating the statistics does not take place with
+	 * INDEX_CLEANUP=OFF.
+	 */
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
 		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/*
+	 * If no indexes, make log report that lazy_vacuum_pruned_items() would've
+	 * made
+	 */
+	Assert(nindexes == 0 || vacuumed_pages == 0);
+	if (nindexes == 0)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
 						vacrelstats->relname,
-						tups_vacuumed, vacuumed_pages)));
+						vacrelstats->tuples_deleted, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 c.nkeep, OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 c.nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrelstats->pinskipped_pages),
@@ -1822,18 +2012,73 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": newly pruned %.0f items, found %.0f nonremovable items in %u out of %u pages",
 					vacrelstats->relname,
-					tups_vacuumed, num_tuples,
+					c.tups_vacuumed, c.num_tuples,
 					vacrelstats->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
+ * Remove the collected garbage tuples from the table and its indexes.
  *
- * We process the indexes serially unless we are doing parallel vacuum.
+ * We may be required to skip index vacuuming by INDEX_CLEANUP reloption.
+ */
+static void
+lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
+						 Relation *Irel, int nindexes, LVParallelState *lps,
+						 VacOptTernaryValue index_cleanup)
+{
+	bool		skipping;
+
+	/* Should not end up here with no indexes */
+	Assert(nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	/* Check whether or not to do index vacuum and heap vacuum */
+	if (index_cleanup == VACOPT_TERNARY_DISABLED)
+		skipping = true;
+	else
+		skipping = false;
+
+	if (!skipping)
+	{
+		/* Okay, we're going to do index vacuuming */
+		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+
+		/* Remove tuples from heap */
+		lazy_vacuum_heap(onerel, vacrelstats);
+	}
+	else
+	{
+		/*
+		 * skipped index vacuuming.  Make log report that lazy_vacuum_heap
+		 * would've made.
+		 *
+		 * Don't report tups_vacuumed here because it will be zero here in
+		 * common case where there are no newly pruned LP_DEAD items for this
+		 * VACUUM.  This is roughly consistent with lazy_vacuum_heap(), and
+		 * the similar "nindexes == 0" specific ereport() at the end of
+		 * lazy_scan_heap().
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": INDEX_CLEANUP off forced VACUUM to not totally remove %d pruned items",
+						vacrelstats->relname,
+						vacrelstats->dead_tuples->num_tuples)));
+	}
+
+	/*
+	 * Forget the now-vacuumed tuples, and press on, but be careful not to
+	 * reset latestRemovedXid since we want that value to be valid.
+	 */
+	vacrelstats->dead_tuples->num_tuples = 0;
+}
+
+/*
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Should only be called through lazy_vacuum_pruned_items().
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -1882,17 +2127,14 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 								 vacrelstats->num_index_scans);
 }
 
-
 /*
- *	lazy_vacuum_heap() -- second pass over the heap
+ *	lazy_vacuum_heap() -- second pass over the heap for two pass strategy
  *
  *		This routine marks dead tuples as unused and compacts out free
  *		space on their pages.  Pages not having dead tuples recorded from
  *		lazy_scan_heap are not visited at all.
  *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
+ * Should only be called through lazy_vacuum_pruned_items().
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2898,14 +3140,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2930,12 +3172,12 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks, bool hasindex)
 {
 	LVDeadTuples *dead_tuples = NULL;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	maxtuples = compute_max_dead_tuples(relblocks, hasindex);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
@@ -3055,7 +3297,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * scan_new_page. So if you change anything here, also check that code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -3101,7 +3343,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in scan_new_page. */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
-- 
2.27.0

v6-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v6-0002-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From ad1cbb84bce822371fd13efe8939932cef5b9f17 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Mar 2021 14:46:21 -0700
Subject: [PATCH v6 2/4] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 +++++++++-----------
 src/backend/access/heap/pruneheap.c      |  60 +++---
 src/backend/access/heap/vacuumlazy.c     | 232 +++++++++++------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 ++--
 src/backend/replication/logical/decode.c |   4 +-
 src/backend/storage/page/bufpage.c       |  20 +-
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++--
 src/tools/pgindent/typedefs.list         |   4 +-
 12 files changed, 301 insertions(+), 323 deletions(-)

diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7cb87f4a3b..1d30a92420 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7528,7 +7528,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7947,88 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8500,34 +8418,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
- */
-static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8538,12 +8437,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8596,7 +8491,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8608,6 +8503,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9712,15 +9681,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9bebb94968..132cfcba16 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -310,7 +310,6 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Statistics about indexes */
@@ -789,39 +788,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
-}
-
 /*
  * Handle new page during lazy_scan_heap().
  *
@@ -914,28 +880,50 @@ scan_empty_page(Relation onerel, Buffer buf, Buffer vmbuffer,
  *	scan_prune_page() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.
  */
 static void
 scan_prune_page(Relation onerel, Buffer buf,
 				LVRelStats *vacrelstats,
 				GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
 				LVTempCounters *c, LVPrunePageState *ps,
-				LVVisMapPageState *vms,
-				VacOptTernaryValue index_cleanup)
+				LVVisMapPageState *vms)
 {
 	BlockNumber blkno;
 	Page		page;
 	OffsetNumber offnum,
 				maxoff;
+	HTSV_Result tuplestate;
 	int			nfrozen,
 				ndead;
 	LVTempCounters pc;
 	OffsetNumber deaditems[MaxHeapTuplesPerPage];
-	bool		tupgone;
 
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	pc.num_tuples = 0;
 	pc.live_tuples = 0;
@@ -951,12 +939,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	 */
 	pc.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
 									   InvalidTransactionId, 0, false,
-									   &vacrelstats->latestRemovedXid,
 									   &vacrelstats->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set vms.visibility_cutoff_xid it doesn't
+	 * matter -- the newest XMIN on page can't be missed this way.
 	 */
 	ps->hastup = false;
 	ps->has_dead_items = false;
@@ -966,7 +956,14 @@ scan_prune_page(Relation onerel, Buffer buf,
 	ndead = 0;
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	tupgone = false;
+#ifdef DEBUG
+
+	/*
+	 * Enable this to debug the retry logic -- it's actually quite hard to hit
+	 * even with this artificial delay
+	 */
+	pg_usleep(10000);
+#endif
 
 	/*
 	 * Note: If you change anything in the loop below, also look at
@@ -978,6 +975,7 @@ scan_prune_page(Relation onerel, Buffer buf,
 	{
 		ItemId		itemid;
 		HeapTupleData tuple;
+		bool		tuple_totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1026,6 +1024,17 @@ scan_prune_page(Relation onerel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		tuplestate = HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf);
+
+		if (unlikely(tuplestate == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1036,42 +1045,8 @@ scan_prune_page(Relation onerel, Buffer buf,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (tuplestate)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					pc.nkeep += 1;
-				else
-					tupgone = true; /* we can delete the tuple */
-				ps->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1152,35 +1127,22 @@ scan_prune_page(Relation onerel, Buffer buf,
 				break;
 		}
 
-		if (tupgone)
-		{
-			deaditems[ndead++] = offnum;
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrelstats->latestRemovedXid);
-			pc.tups_vacuumed += 1;
-			ps->has_dead_items = true;
-		}
-		else
-		{
-			bool		tuple_totally_frozen;
-
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			if (heap_prepare_freeze_tuple(tuple.t_data,
-										  RelFrozenXid, RelMinMxid,
-										  FreezeLimit, MultiXactCutoff,
-										  &frozen[nfrozen],
-										  &tuple_totally_frozen))
-				frozen[nfrozen++].offset = offnum;
-
-			pc.num_tuples += 1;
-			ps->hastup = true;
-
-			if (!tuple_totally_frozen)
-				ps->all_frozen = false;
-		}
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  RelFrozenXid, RelMinMxid,
+									  FreezeLimit, MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
+			frozen[nfrozen++].offset = offnum;
+
+		pc.num_tuples += 1;
+		ps->hastup = true;
+
+		if (!tuple_totally_frozen)
+			ps->all_frozen = false;
 	}
 
 	/*
@@ -1813,7 +1775,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * tuple headers left behind following pruning.
 		 */
 		scan_prune_page(onerel, buf, vacrelstats, vistest, frozen,
-						&c, &ps, &vms, params->index_cleanup);
+						&c, &ps, &vms);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -2079,6 +2041,11 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
  * Should only be called through lazy_vacuum_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
@@ -2088,9 +2055,6 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	Assert(!IsParallelWorker());
 	Assert(nindexes > 0);
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2135,6 +2099,11 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
  *		lazy_scan_heap are not visited at all.
  *
  * Should only be called through lazy_vacuum_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
@@ -2170,12 +2139,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 		vacrelstats->blkno = tblk;
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
 									&vmbuffer);
 
@@ -2208,14 +2172,25 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ *	lazy_vacuum_page() -- free page's LP_DEAD items listed in the
+ *					 vacrelstats->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrelstats->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same heap page was visited by scan_prune_page() (i.e. those whose
+ * TID was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
@@ -2248,11 +2223,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2262,12 +2241,19 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2280,10 +2266,10 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
 								 &visibility_cutoff_xid,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..391caf7396 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -480,8 +480,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..0c4c07503a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -250,14 +250,18 @@ PageAddItemExtended(Page page,
 		/* if no free slot, we'll put it at limit (1st open slot) */
 		if (PageHasFreeLinePointers(phdr))
 		{
-			/*
-			 * Look for "recyclable" (unused) ItemId.  We check for no storage
-			 * as well, just to be paranoid --- unused items should never have
-			 * storage.
-			 */
+			/* Look for "recyclable" (unused) ItemId */
 			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 			{
 				itemId = PageGetItemId(phdr, offsetNumber);
+
+				/*
+				 * We check for no storage as well, just to be paranoid;
+				 * unused items should never have storage.  Assert() that the
+				 * invariant is respected too.
+				 */
+				Assert(ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId));
+
 				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
 					break;
 			}
@@ -676,7 +680,9 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
- * As a side effect, the page's PD_HAS_FREE_LINES hint bit is updated.
+ * Caller had better have a super-exclusive lock on page's buffer.  As a side
+ * effect, the page's PD_HAS_FREE_LINES hint bit is updated in cases where our
+ * caller (the heap prune code) sets one or more line pointers unused.
  */
 void
 PageRepairFragmentation(Page page)
@@ -771,7 +777,7 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
-	/* Set hint bit for PageAddItem */
+	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
 	else
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..e6055d1ecd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Use by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2f0e..adf7c42a03 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3555,8 +3555,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3568,9 +3566,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v6-0004-Skip-index-vacuum-if-the-table-is-at-risk-XID-wra.patchapplication/octet-stream; name=v6-0004-Skip-index-vacuum-if-the-table-is-at-risk-XID-wra.patchDownload

From e19efb562dcfbd2ca8ebbb71b331da09ea934f3a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 24 Mar 2021 11:27:05 +0900
Subject: [PATCH v6 4/4] Skip index vacuum if the table is at risk XID
 wraparound.

This commit add new GUC parameters vacuum_skip_index_age and
vacuum_multixact_skip_index_age that specify age at which VACUUM
should skip index cleanup to hurry finishing in order to
advance relfrozenxid/relminmxid.

After each index vacuuming (in non-parallel vacuum case), we check if
the table's relfrozenxid/relminmxid are too old comparing those new
GUC parameters. If so, we skip further index vacuuming within the
vacuum operation.

This behavior is intended to deal with the risk of XID wraparound, the
default values are much higher, 1.8 billion.

Although users can set those parameters, VACUUM will silently
adjust the effective value more than 105% of
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so that
only anti-wraparound autovacuuma and aggressive scan have a change to
skip index vacuuming.
---
 doc/src/sgml/config.sgml                      |  51 ++++++
 doc/src/sgml/maintenance.sgml                 |  10 +-
 src/backend/access/heap/vacuumlazy.c          | 146 +++++++++++++++---
 src/backend/commands/vacuum.c                 |   2 +
 src/backend/utils/misc/guc.c                  |  25 ++-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/commands/vacuum.h                 |   2 +
 7 files changed, 217 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5679b40dd5..71483d8598 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8545,6 +8545,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-skip-index-age" xreflabel="vacuum_skip_index_age">
+      <term><varname>vacuum_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only anti-wraparound
+        autovacuums and aggressive scans have a chance to skip index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8591,6 +8616,32 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-vacuum-skip-index-age" xreflabel="vacuum_multixact_skip_index_age">
+      <term><varname>vacuum_multixact_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion multixacts. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..4d3674c1b4 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -607,8 +607,14 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
 
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
+    system will begin to skip index cleanup to hurry finishing vacuum
+    operation. <xref linkend="guc-vacuum-skip-index-age"/> controls when
+    <command>VACUUM</command> and autovacuum do that.
+   </para>
+
+    <para>
+     The system emits warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
 <programlisting>
 WARNING:  database "mydb" must be vacuumed within 39985967 transactions
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ac250d0fab..0885dc4b08 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -317,6 +317,7 @@ typedef struct LVRelStats
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
 	bool		lock_waiter_detected;
+	bool		skip_index_vacuum; /* skip further index vacuuming/cleanup ? */
 
 	/* Statistics about indexes */
 	IndexBulkDeleteResult **indstats;
@@ -398,9 +399,10 @@ static void lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
+static bool check_index_vacuum_xid_limit(Relation onerel);
+static bool lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
+									int nindexes, VacOptTernaryValue index_cleanup);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
 static void lazy_cleanup_index(Relation indrel,
@@ -558,6 +560,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrelstats->num_index_scans = 0;
 	vacrelstats->pages_removed = 0;
 	vacrelstats->lock_waiter_detected = false;
+	vacrelstats->skip_index_vacuum = false;
 
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
@@ -1964,7 +1967,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * lazy_vacuum_pruned_items() decided to skip index vacuuming, but not
 	 * with INDEX_CLEANUP=OFF.
 	 */
-	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED)
+	if (nindexes > 0 && params->index_cleanup != VACOPT_TERNARY_DISABLED &&
+		!vacrelstats->skip_index_vacuum)
 		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
 
 	/*
@@ -2049,6 +2053,17 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 	if (index_cleanup == VACOPT_TERNARY_DISABLED)
 		skipping = true;
 
+	/*
+	 * Skip index vacuuming if the table's relfrozenxid/relminmxid is too
+	 * old so at risk of XID wraparound.  Once we decided to skip index
+	 * vacuuming, the decision never goes back to index vacuuming.  This saves
+	 * extra check_index_vacuum_xid_limit() calls and is less confusing for
+	 * users since we have ereport'ed that we decided not to do index
+	 * vacuuming.
+	 */
+	else if (vacrelstats->skip_index_vacuum)
+		skipping = true;
+
 	/*
 	 * Don't skip index and heap vacuuming if it's not only called once during
 	 * the entire vacuum operation.
@@ -2083,12 +2098,10 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
 			skipping = false;
 	}
 
-	if (!skipping)
+	if (!skipping && lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes,
+											 index_cleanup))
 	{
-		/* Okay, we're going to do index vacuuming */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
-
-		/* Remove tuples from heap */
+		/* All dead tuples in indexes are removed, so remove tuples from heap as well */
 		lazy_vacuum_heap(onerel, vacrelstats);
 	}
 	else
@@ -2133,14 +2146,29 @@ lazy_vacuum_pruned_items(Relation onerel, LVRelStats *vacrelstats,
  * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
  * that took place earlier, usually though not always during the ongoing
  * VACUUM operation).
+ *
+ * Return true if we vacuum all indexes.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
+						int nindexes, VacOptTernaryValue index_cleanup)
 {
+	int i;
+
 	Assert(!IsParallelWorker());
 	Assert(nindexes > 0);
+	Assert(index_cleanup == VACOPT_TERNARY_ENABLED);
+
+	/* Check if the table is at risk of XID wraparound */
+	if (check_index_vacuum_xid_limit(onerel))
+	{
+		vacrelstats->skip_index_vacuum = true;
+		return false;
+	}
+
+	/* Increase and report the number of index scans */
+	vacrelstats->num_index_scans++;
 
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -2161,21 +2189,103 @@ lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 		lps->lvshared->estimated_count = true;
 
 		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
+									 vacrelstats->num_index_scans);
+
+		/*
+		 * In parallel vacuum, since we hand the indexes over to parallel vacuum
+		 * workers, always return true.
+		 */
+		return true;
 	}
-	else
+
+	/*
+	 * Vacuum the indexes one by one. If index_cleanup option is on, we check
+	 * if the table's relfrozenxid/relminmxid is too old after each index vacuuming.
+	 * If so, we stop index vacuuming and return false, telling the caller not to
+	 * delete LP_DEAD items.
+	 */
+	for (i = 0; i < nindexes; i++)
 	{
-		int			idx;
+		lazy_vacuum_index(Irel[i], &(vacrelstats->indstats[i]),
+						  vacrelstats->dead_tuples,
+						  vacrelstats->old_live_tuples, vacrelstats);
 
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							  vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
+		if (check_index_vacuum_xid_limit(onerel))
+		{
+			/* Stop index vacuuming */
+			vacrelstats->skip_index_vacuum = true;
+			break;
+		}
 	}
 
-	/* Increase and report the number of index scans */
-	vacrelstats->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrelstats->num_index_scans);
+
+	/* Vacuumed all indexes? */
+	return (i >= nindexes);
+}
+
+/*
+ * Return true if the table's relfrozenxid/relminmxid is older than the skip
+ * index vacuum age.
+ */
+static bool
+check_index_vacuum_xid_limit(Relation onerel)
+{
+	TransactionId	xid_skip_limit;
+	MultiXactId		multi_skip_limit;
+	int	skip_index_vacuum;
+	int effective_multixact_freeze_max_age;
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_skip_index_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(onerel->rd_rel->relfrozenxid) &&
+		TransactionIdPrecedes(onerel->rd_rel->relfrozenxid,
+							  xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_skip_index_age,
+							 autovacuum_multixact_freeze_max_age * 1.05);
+
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(onerel->rd_rel->relminmxid) &&
+		MultiXactIdPrecedes(onerel->rd_rel->relminmxid,
+							multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..f6256a65c8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_skip_index_age;
+int			vacuum_multixact_skip_index_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3b36a31a47..7dc7e6f44b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2624,6 +2624,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_multixact_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3224,7 +3244,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_skip_index_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 86425965d0..30bc4d5f45 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -674,6 +674,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_skip_index_age = 1800000000
+#vacuum_multixact_skip_index_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..741437cdaf 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_skip_index_age;
+extern int	vacuum_multixact_skip_index_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
-- 
2.27.0

#88

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#87)

4 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 24, 2021 at 6:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated patch set (nothing changed in 0001 and 0002 patch).

Attached is v7, which takes the last two patches from your v6 and
rebases them on top of my recent work. This includes handling index
cleanup consistently in the event of an emergency. We want to do index
cleanup when the optimization case works out. It would be arbitrary to
not do index cleanup because there was 1 dead tuple instead of 0. Plus
index cleanup is actually useful in some index AMs. At the same time,
it seems like a bad idea to do cleanup in an emergency case. Note that
this includes the case where the new wraparound mechanism kicks in, as
well as the case where INDEX_CLEANUP = off. In general INDEX_CLEANUP =
off should be 100% equivalent to the emergency mechanism, except that
the decision is made dynamically instead of statically.

The two patches that you have been working on are combined into one
patch in v7 -- the last patch in the series. Maintaining two separate
patches there doesn't seem that useful.

The main change I've made in v7 is structural. There is a new patch in
the series, which is now the first. It adds more variables to the
top-level state variable used by VACUUM. We shouldn't have to pass the
same "onerel" variable and other similar variables to so many similar
functions. Plus we shouldn't rely on global state so much. That makes
the code a lot easier to understand. Another change that appears in
the first patch concerns parallel VACUUM, and how it is structured. It
is hard to know which function concerned parallel VACUUM and which is
broader than that right now. It makes it seriously hard to follow at
times. So I have consolidated those functions, and given them less
generic, more descriptive names. (In general it should be possible to
read most of the code in vacuumlazy.c without thinking about parallel
VACUUM in particular.)

I had many problems with existing function arguments that look like this:

IndexBulkDeleteResult **stats // this is a pointer to a pointer to a
IndexBulkDeleteResult.

Sometimes this exact spelling indicates: 1. "This is one particular
index's stats -- this function will have the index AM set the
statistics during ambulkdelete() and/or amvacuumcleanup()".

But at other times/with other function arguments, it indicates: 2.
"Array of stats, once for each of the heap relation's indexes".

I found the fact that both 1 and 2 appear together side by side very
confusing. It is much clearer with 0001-*, though. It establishes a
consistent singular vs plural variable naming convention. It also no
longer uses IndexBulkDeleteResult ** args for case 1 -- even the C
type system ambiguity is avoided. Any thoughts on my approach to this?

Another change in v7: We now stop applying any cost-based delay that
may be in force if and when we abandon index vacuuming to finish off
the VACUUM operation. Robert thought that that was important, and I
agree. I think that it's 100% justified, because this is a true
emergency. When the emergency mechanism (though not INDEX_CLEANUP=off)
actually kicks in, we now also have a scary WARNING. Since this
behavior only occurs when the system is definitely at very real risk
of becoming unavailable, we can justify practically any intervention
that makes it less likely that the system will become 100% unavailable
(except for anything that creates additional risk of data loss).

BTW, spotted this compiler warning in v6:

/code/postgresql/patch/build/../source/src/backend/access/heap/vacuumlazy.c:
In function ‘check_index_vacuum_xid_limit’:
/code/postgresql/patch/build/../source/src/backend/access/heap/vacuumlazy.c:2314:6:
warning: variable ‘effective_multixact_freeze_max_age’ set but not
used [-Wunused-but-set-variable]
2314 | int effective_multixact_freeze_max_age;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I think that it's just a leftover chunk of code. The variable in
question ('effective_multixact_freeze_max_age') does not appear in v7,
in any case. BTW I moved this function into vacuum.c, next to
vacuum_set_xid_limits() -- that seemed like a better place for it. But
please check this yourself.

Regarding "auto" option, I think it would be a good start to enable
the index vacuum skipping behavior by default instead of adding “auto”
mode. That is, we could skip index vacuuming if INDEX_CLEANUP ON. With
0003 and 0004 patch, there are two cases where we skip index
vacuuming: the garbage on heap is very concentrated and the table is
at risk of XID wraparound. It seems to make sense to have both
behaviors by default.

I agree. Adding a new "auto" option now seems to me to be unnecessary
complexity. Besides, switching a boolean reloption to a bool-like enum
reloption may have subtle problems.

If we want to have a way to force doing index
vacuuming, we can add “force” option instead of adding “auto” option
and having “on” mode force doing index vacuuming.

It's hard to imagine anybody using the "force option". Let's not have
one. Let's not change the fact that "INDEX_CLEANUP = on" means
"default index vacuuming behavior". Let's just change the index
vacuuming behavior. If we get the details wrong, a simple reloption
will make little difference. We're already being fairly conservative
in terms of the skipping behavior, including with the
SKIP_VACUUM_PAGES_RATIO.

Also regarding new GUC parameters, vacuum_skip_index_age and
vacuum_multixact_skip_index_age, those are not autovacuum-dedicated
parameter. VACUUM command also uses those parameters to skip index
vacuuming dynamically. In such an emergency case, it seems appropriate
to me to skip index vacuuming even in VACUUM command. And I don’t add
any reloption for those two parameters. Since those parameters are
unlikely to be changed from the default value, I think it don’t
necessarily need to provide a way for per-table configuration.

+1 for all that. We already have a reloption for this behavior, more
or less -- it's called INDEX_CLEANUP.

The existing autovacuum_freeze_max_age GUC (which is highly related to
your new GUCs) is both an autovacuum GUC, and somehow also not an
autovacuum GUC at the same time. The apparent contradiction only seems
to resolve itself when you consider the perspective of DBAs and the
perspective of Postgres hackers separately.

*Every* VACUUM GUC is an autovacuum GUC when you know for sure that
the relfrozenxid is 1 billion+ XIDs in the past.

+ bool skipping;

Can we flip the boolean? I mean to use a positive form such as
"do_vacuum". It seems to be more readable especially for the changes
made in 0003 and 0004 patches.

I agree that it's clearer that way around.

The code is structured this way in v7. Specifically, there are now
both do_index_vacuuming and do_index_cleanup in the per-VACUUM state
struct in patch 0002-*. These are a direct replacement for useindex.
(Though we only start caring about the do_index_cleanup field in the
final patch, 0004-*)

Thanks
--
Peter Geoghegan

Attachments:

v7-0004-Skip-index-vacuuming-in-some-cases.patchapplication/octet-stream; name=v7-0004-Skip-index-vacuuming-in-some-cases.patchDownload

From 1e6d198796e5aac482eac1f9b16a5b12f550bfcd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 25 Mar 2021 13:54:32 -0700
Subject: [PATCH v7 4/4] Skip index vacuuming in some cases.

Skip index vacuuming in two cases: The case where there are so few dead
tuples that index vacuuming seems unnecessary, and the case where the
relfrozenxid of the table being vacuumed is dangerously far in the past.

This commit add new GUC parameters vacuum_skip_index_age and
vacuum_multixact_skip_index_age that specify age at which VACUUM
should skip index cleanup to hurry finishing in order to
advance relfrozenxid/relminmxid.

After each index vacuuming (in non-parallel vacuum case), we check if
the table's relfrozenxid/relminmxid are too old comparing those new
GUC parameters. If so, we skip further index vacuuming within the
vacuum operation.

This behavior is intended to deal with the risk of XID wraparound, the
default values are much higher, 1.8 billion.

Although users can set those parameters, VACUUM will silently
adjust the effective value more than 105% of
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so that
only anti-wraparound autovacuuma and aggressive scan have a change to
skip index vacuuming.
---
 src/include/commands/vacuum.h                 |   4 +
 src/backend/access/heap/vacuumlazy.c          | 244 ++++++++++++++++--
 src/backend/commands/vacuum.c                 |  61 +++++
 src/backend/utils/misc/guc.c                  |  25 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 doc/src/sgml/config.sgml                      |  51 ++++
 doc/src/sgml/maintenance.sgml                 |  10 +-
 7 files changed, 377 insertions(+), 20 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..d3d44d9bac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_skip_index_age;
+extern int	vacuum_multixact_skip_index_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -270,6 +272,8 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  TransactionId *xidFullScanLimit,
 								  MultiXactId *multiXactCutoff,
 								  MultiXactId *mxactFullScanLimit);
+extern bool vacuum_xid_limit_emergency(TransactionId relfrozenxid,
+									   MultiXactId   relminmxid);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 74ec751466..eb99238297 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.01
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -403,9 +409,11 @@ static void lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 								  LVTempCounters *scancounts,
 								  LVPagePruneState *pageprunestate,
 								  LVPageVisMapState *pagevmstate);
-static void lazy_vacuum_all_pruned_items(LVRelState *vacrel);
+static void lazy_vacuum_all_pruned_items(LVRelState *vacrel,
+										 BlockNumber has_dead_items_pages,
+										 bool onecall);
 static void lazy_vacuum_heap(LVRelState *vacrel);
-static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
 													double reltuples,
@@ -860,7 +868,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
@@ -874,7 +883,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* Counters of # blocks in onerel: */
 	BlockNumber empty_pages,
-				vacuumed_pages;
+				vacuumed_pages,
+				has_dead_items_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -889,7 +899,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relnamespace,
 						vacrel->relname)));
 
-	empty_pages = vacuumed_pages = 0;
+	empty_pages = vacuumed_pages = has_dead_items_pages = 0;
 
 	/* Initialize counters */
 	scancounts.num_tuples = 0;
@@ -1126,8 +1136,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Definitely won't be skipping index vacuuming due to finding
+			 * very few dead items during this VACUUM operation -- that's only
+			 * something that lazy_vacuum_all_pruned_items() is willing to do
+			 * when it is only called once during the entire VACUUM operation.
+			 */
+			have_vacuumed_indexes = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
-			lazy_vacuum_all_pruned_items(vacrel);
+			lazy_vacuum_all_pruned_items(vacrel, has_dead_items_pages, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1265,6 +1283,17 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		lazy_prune_page_items(vacrel, buf, vistest, frozen, &scancounts,
 							  &pageprunestate, &pagevmstate);
 
+		/*
+		 * Remember the number of pages having at least one LP_DEAD line
+		 * pointer.  This could be from this VACUUM, a previous VACUUM, or
+		 * even opportunistic pruning.  Note that this is exactly the same
+		 * thing as having items that are stored in dead_tuples space, because
+		 * lazy_prune_page_items() doesn't count anything other than LP_DEAD
+		 * items as dead (as of PostgreSQL 14).
+		 */
+		if (pageprunestate.has_dead_items)
+			has_dead_items_pages++;
+
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
 		 * end of loop.  (Also performs extra single pass strategy steps in
@@ -1287,7 +1316,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * Note: It's not in fact 100% certain that we really will call
 			 * lazy_vacuum_heap() -- lazy_vacuum_all_pruned_items() might opt
 			 * to skip index vacuuming (and so must skip heap vacuuming).
-			 * This is deemed okay because it only happens in emergencies.
+			 * This is deemed okay because it only happens in emergencies, or
+			 * when there is very little free space anyway.
 			 */
 		}
 		else
@@ -1399,7 +1429,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-		lazy_vacuum_all_pruned_items(vacrel);
+		lazy_vacuum_all_pruned_items(vacrel, has_dead_items_pages,
+									 !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -2059,10 +2090,16 @@ retry:
 
 /*
  * Remove the collected garbage tuples from the table and its indexes.
+ *
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
-lazy_vacuum_all_pruned_items(LVRelState *vacrel)
+lazy_vacuum_all_pruned_items(LVRelState *vacrel,
+							 BlockNumber has_dead_items_pages, bool onecall)
 {
+	bool		applyskipoptimization;
+
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
 	Assert(!IsParallelWorker());
@@ -2070,19 +2107,159 @@ lazy_vacuum_all_pruned_items(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		/*
-		 * Just ignore second or subsequent calls in when INDEX_CLEANUP off
-		 * was specified
+		 * Just ignore second or subsequent calls in emergency cases.  This
+		 * includes when INDEX_CLEANUP off was specified.
 		 */
 		Assert(!vacrel->do_index_cleanup);
 		vacrel->dead_tuples->num_tuples = 0;
 		return;
 	}
 
-	/* Okay, we're going to do index vacuuming */
-	lazy_vacuum_all_indexes(vacrel);
+	/*
+	 * Consider applying the optimization where we skip index vacuuming to
+	 * save work in indexes that is likely to have little upside.  This is
+	 * expected to help in the extreme (though still common) case where
+	 * autovacuum generally only triggers VACUUMs against the table because of
+	 * the need to freeze tuples and/or the need to set visibility map bits.
+	 * The overall effect is that cases where the table is slightly less than
+	 * 100% append-only (where there are some dead tuples, but very few) tend
+	 * to behave almost as if they really were 100% append-only.
+	 *
+	 * Our approach is to skip index vacuuming when there are very few heap
+	 * pages with dead items.  Even then, it must be the first and last call
+	 * here for the VACUUM (we never apply the optimization when we're low on
+	 * space for TIDs).  This threshold allows us to not give too much weight
+	 * to items that are concentrated in relatively few heap pages.  These are
+	 * usually due to correlated non-HOT UPDATEs.
+	 *
+	 * It's important that we avoid putting off a VACUUM that eventually
+	 * dirties index pages more often than would happen if we didn't skip.
+	 * It's also important to avoid allowing relatively many heap pages that
+	 * can never have their visibility map bit set to stay that way
+	 * indefinitely.
+	 *
+	 * In general the criteria that we apply here must not create distinct new
+	 * problems for the logic that schedules autovacuum workers.  For example,
+	 * we cannot allow autovacuum_vacuum_insert_scale_factor-driven autovacuum
+	 * workers to do little or no useful work due to misapplication of this
+	 * optimization.  While the optimization is expressly designed to avoid
+	 * work that has non-zero value to the system, the value of that work
+	 * should be close to zero.  There should be a natural asymmetry between
+	 * the costs and the benefits of skipping.
+	 */
+	applyskipoptimization = false;
+	if (onecall)
+	{
+		BlockNumber threshold;
 
-	/* Remove tuples from heap */
-	lazy_vacuum_heap(vacrel);
+		Assert(vacrel->num_index_scans == 0);
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		threshold = (double) vacrel->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		applyskipoptimization = (has_dead_items_pages < threshold);
+	}
+
+	if (applyskipoptimization)
+	{
+		/*
+		 * skipped index vacuuming due to optimization.  Make log report that
+		 * lazy_vacuum_heap would've made.
+		 *
+		 * Don't report tups_vacuumed here because it will be zero here in
+		 * common case where there are no newly pruned LP_DEAD items for this
+		 * VACUUM.  This is roughly consistent with lazy_vacuum_heap(), and
+		 * the similar "nindexes == 0" specific ereport() at the end of
+		 * lazy_scan_heap().
+		 */
+		ereport(elevel,
+				(errmsg("\"%s\": opted to not totally remove %d pruned items in %u pages",
+						vacrel->relname, vacrel->dead_tuples->num_tuples,
+						has_dead_items_pages)));
+
+		/*
+		 * Skip index vacuuming, but don't skip index cleanup.
+		 *
+		 * It wouldn't make sense to not do cleanup just because this
+		 * optimization was applied.  (As a general rule, the case where there
+		 * are _almost_ zero dead items when vacuuming a large table should
+		 * not behave very differently from the case where there are precisely
+		 * zero dead items.)
+		 */
+		vacrel->do_index_vacuuming = false;
+	}
+	else if (lazy_vacuum_all_indexes(vacrel))
+	{
+		/*
+		 * We successfully completed a round of index vacuuming.  Do related
+		 * heap vacuuming now.
+		 *
+		 * There will be no calls to vacuum_xid_limit_emergency() to check
+		 * for issues with the age of the table's relfrozenxid unless and
+		 * until there is another call here -- heap vacuuming doesn't do that.
+		 * This should be okay, because the cost of a round of heap vacuuming
+		 * is much more linear.  Also, it has costs that are unaffected by the
+		 * number of indexes total.
+		 */
+		lazy_vacuum_heap(vacrel);
+	}
+	else
+	{
+		/*
+		 * Emergency case:  We attempted index vacuuming, didn't finish
+		 * another round of index vacuuming (or one that reliably deleted
+		 * tuples from all of the table's indexes, at least).  This happens
+		 * when the table's relfrozenxid is too far in the past.
+		 *
+		 * From this point on the VACUUM operation will do no further index
+		 * vacuuming or heap vacuuming.  It will do any remaining pruning that
+		 * is required, plus other heap-related and relation-level maintenance
+		 * tasks.  But that's it.  We also disable a cost delay when a delay
+		 * is in effect.
+		 *
+		 * Note that we deliberately don't vary our behavior based on factors
+		 * like whether or not the ongoing VACUUM is aggressive.  If it's not
+		 * aggressive we probably won't be able to advance relfrozenxid during
+		 * this VACUUM.  If we can't, then an anti-wraparound VACUUM should
+		 * take place immediately after we finish up.  We should be able to
+		 * skip all index vacuuming for the later anti-wraparound VACUUM.
+		 *
+		 * This is very much like the "CLEANUP_INDEX = off" case, except we
+		 * determine that index vacuuming will be skipped dynamically.
+		 * Another difference is that we don't warn the user in the
+		 * INDEX_CLEANUP off case, and we don't presume to stop applying a
+		 * cost delay.
+		 */
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+		ereport(WARNING,
+				(errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a fail safe after %d index scans",
+						get_database_name(MyDatabaseId),
+						vacrel->relname,
+						vacrel->relname,
+						vacrel->num_index_scans),
+				 errdetail("table's relfrozenxid or relminmxid is too far in the past"),
+				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
+						 "You might also need to consider other ways for VACUUM to keep up with the allocation of transaction IDs.")));
+
+		/* Stop applying cost limits from this point on */
+		VacuumCostActive = false;
+		VacuumCostBalance = 0;
+		VacuumPageHit = 0;
+		VacuumPageMiss = 0;
+		VacuumPageDirty = 0;
+
+		/*
+		 * TODO:
+		 *
+		 * Call lazy_space_free() and arrange to stop even recording TIDs
+		 * (i.e. make lazy_record_dead_tuple() into a no-op)
+		 */
+	}
 
 	/*
 	 * Forget the now-vacuumed tuples -- just press on
@@ -2099,14 +2276,30 @@ lazy_vacuum_all_pruned_items(LVRelState *vacrel)
  * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
  * that took place earlier, usually though not always during the ongoing
  * VACUUM operation).
+ *
+ * Returns true in the common case when all indexes were successfully
+ * vacuumed.  Returns false in rare cases where we determined that the ongoing
+ * VACUUM operation is at risk of taking too long to finish, leading to
+ * wraparound failure.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
+	bool	allindexes = true;
+
 	Assert(vacrel->nindexes > 0);
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
+	/* Precheck for XID wraparound emergencies */
+	if (vacuum_xid_limit_emergency(vacrel->relfrozenxid, vacrel->relminmxid))
+	{
+		/* Wraparound emergency -- don't even start an index scan */
+		return false;
+	}
+
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2121,18 +2314,35 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			vacrel->indstats[idx] =
 				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
 									  vacrel);
+
+			if (vacuum_xid_limit_emergency(vacrel->relfrozenxid,
+										   vacrel->relminmxid))
+			{
+				/* Wraparound emergency -- end current index scan */
+				allindexes = false;
+				break;
+			}
 		}
 	}
 	else
 	{
+		/* Note: parallel VACUUM only gets the precheck */
+		allindexes = true;
+
 		/* Outsource everything to parallel variant */
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
-	/* Increase and report the number of index scans */
+	/*
+	 * Increase and report the number of index scans.  Note that we include
+	 * the case where we started a round index scanning that we weren't able
+	 * to finish.
+	 */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrel->num_index_scans);
+
+	return allindexes;
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..063113cd38 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_skip_index_age;
+int			vacuum_multixact_skip_index_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
@@ -1134,6 +1136,65 @@ vacuum_set_xid_limits(Relation rel,
 	}
 }
 
+/*
+ * vacuum_xid_limit_emergency() -- Handle wraparound emergencies
+ *
+ * Input parameters are the target relation's relfrozenxid and relminmxid.
+ */
+bool
+vacuum_xid_limit_emergency(TransactionId relfrozenxid, MultiXactId relminmxid)
+{
+	TransactionId xid_skip_limit;
+	MultiXactId	  multi_skip_limit;
+	int			  skip_index_vacuum;
+
+	Assert(TransactionIdIsNormal(relfrozenxid));
+	Assert(MultiXactIdIsValid(relminmxid));
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_skip_index_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(relfrozenxid) &&
+		TransactionIdPrecedes(relfrozenxid, xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_skip_index_age,
+							autovacuum_multixact_freeze_max_age * 1.05);
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(relminmxid) &&
+		MultiXactIdPrecedes(relminmxid, multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0c5dc4d3e8..24fb736a72 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2622,6 +2622,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_multixact_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3222,7 +3242,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_skip_index_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b234a6bfe6..7d6564e17f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -673,6 +673,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_skip_index_age = 1800000000
+#vacuum_multixact_skip_index_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..9a21e4a402 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8528,6 +8528,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-skip-index-age" xreflabel="vacuum_skip_index_age">
+      <term><varname>vacuum_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only anti-wraparound
+        autovacuums and aggressive scans have a chance to skip index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8574,6 +8599,32 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-vacuum-skip-index-age" xreflabel="vacuum_multixact_skip_index_age">
+      <term><varname>vacuum_multixact_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion multixacts. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..4d3674c1b4 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -607,8 +607,14 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
 
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
+    system will begin to skip index cleanup to hurry finishing vacuum
+    operation. <xref linkend="guc-vacuum-skip-index-age"/> controls when
+    <command>VACUUM</command> and autovacuum do that.
+   </para>
+
+    <para>
+     The system emits warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
 <programlisting>
 WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-- 
2.27.0

v7-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v7-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 95bdbe132d7b2c50c5c20ac2cc55ba31af1e8db5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 24 Mar 2021 21:32:13 -0700
Subject: [PATCH v7 3/4] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++---
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 +++++++++------------
 src/backend/access/heap/pruneheap.c      |  60 +++---
 src/backend/access/heap/vacuumlazy.c     | 224 ++++++++++-------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 ++--
 src/backend/replication/logical/decode.c |   4 +-
 src/backend/storage/page/bufpage.c       |  20 +-
 src/tools/pgindent/typedefs.list         |   4 +-
 12 files changed, 293 insertions(+), 323 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..d5df7c20df 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Used by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 90711b2fcd..93bd57118e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7528,7 +7528,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7947,88 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8500,34 +8418,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8538,12 +8437,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8596,7 +8491,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8608,6 +8503,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9712,15 +9681,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 7c1047c745..74ec751466 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -327,7 +327,6 @@ typedef struct LVRelState
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	LVDeadTuples *dead_tuples;
 	int			num_index_scans;
-	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
 	/* Statistics output by index AMs */
@@ -403,8 +402,7 @@ static void lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 								  xl_heap_freeze_tuple *frozen,
 								  LVTempCounters *scancounts,
 								  LVPagePruneState *pageprunestate,
-								  LVPageVisMapState *pagevmstate,
-								  VacOptTernaryValue index_cleanup);
+								  LVPageVisMapState *pagevmstate);
 static void lazy_vacuum_all_pruned_items(LVRelState *vacrel);
 static void lazy_vacuum_heap(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
@@ -824,40 +822,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(LVRelState *vacrel)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrel->latestRemovedXid))
-		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
-									 vacrel->latestRemovedXid);
-}
-
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -941,7 +905,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	vacrel->scanned_pages = 0;
 	vacrel->tupcount_pages = 0;
 	vacrel->nonempty_pages = 0;
-	vacrel->latestRemovedXid = InvalidTransactionId;
 
 	vistest = GlobalVisTestFor(vacrel->onerel);
 
@@ -1300,8 +1263,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * tuple headers left behind following pruning.
 		 */
 		lazy_prune_page_items(vacrel, buf, vistest, frozen, &scancounts,
-							  &pageprunestate, &pagevmstate,
-							  params->index_cleanup);
+							  &pageprunestate, &pagevmstate);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -1756,29 +1718,51 @@ lazy_scan_setvmbit_page(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
  *	lazy_prune_page_items() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where lazy_scan_heap()
+ * treated tuples that still had storage after pruning as DEAD.  That happened
+ * when heap_page_prune() could not prune tuples that were nevertheless deemed
+ * DEAD by its own HeapTupleSatisfiesVacuum() call.  This created rare hard to
+ * test cases.  It meant that there was no very sharp distinction between DEAD
+ * tuples and tuples that are to be kept and be considered for freezing inside
+ * heap_prepare_freeze_tuple().  It also meant that lazy_vacuum_page() had to
+ * be prepared to remove items with storage (tuples with tuple headers) that
+ * didn't get pruned, which created a special case to handle recovery
+ * conflicts.
+ *
+ * The approach we take here now (to eliminate all of this complexity) is to
+ * simply restart pruning in these very rare cases -- cases where a concurrent
+ * abort of an xact makes our HeapTupleSatisfiesVacuum() call disagrees with
+ * what heap_page_prune() thought about the tuple only microseconds earlier.
+ *
+ * Since we might have to prune a second time here, the code is structured to
+ * use a local per-page copy of the counters that caller accumulates.  We add
+ * our per-page counters to the per-VACUUM totals from caller last of all, to
+ * avoid double counting.
  */
 static void
 lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 					  GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
 					  LVTempCounters *scancounts,
 					  LVPagePruneState *pageprunestate,
-					  LVPageVisMapState *pagevmstate,
-					  VacOptTernaryValue index_cleanup)
+					  LVPageVisMapState *pagevmstate)
 {
 	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
 	Page		page;
 	OffsetNumber offnum,
 				maxoff;
+	HTSV_Result tuplestate;
 	int			nfrozen,
 				ndead;
 	LVTempCounters pagecounts;
 	OffsetNumber deaditems[MaxHeapTuplesPerPage];
-	bool		tupgone;
 
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	pagecounts.num_tuples = 0;
 	pagecounts.live_tuples = 0;
@@ -1794,12 +1778,14 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 	 */
 	pagecounts.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
 											   InvalidTransactionId, 0, false,
-											   &vacrel->latestRemovedXid,
 											   &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set pagevmstate.visibility_cutoff_xid it
+	 * doesn't matter -- the newest XMIN on page can't be missed this way.
 	 */
 	pageprunestate->hastup = false;
 	pageprunestate->has_dead_items = false;
@@ -1809,7 +1795,14 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 	ndead = 0;
 	maxoff = PageGetMaxOffsetNumber(page);
 
-	tupgone = false;
+#ifdef DEBUG
+
+	/*
+	 * Enable this to debug the retry logic -- it's actually quite hard to hit
+	 * even with this artificial delay
+	 */
+	pg_usleep(10000);
+#endif
 
 	/*
 	 * Note: If you change anything in the loop below, also look at
@@ -1821,6 +1814,7 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 	{
 		ItemId		itemid;
 		HeapTupleData tuple;
+		bool		tuple_totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1869,6 +1863,18 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		tuplestate = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin,
+											  buf);
+
+		if (unlikely(tuplestate == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1879,42 +1885,8 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (tuplestate)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					pagecounts.nkeep += 1;
-				else
-					tupgone = true; /* we can delete the tuple */
-				pageprunestate->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1996,37 +1968,24 @@ lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
 				break;
 		}
 
-		if (tupgone)
-		{
-			deaditems[ndead++] = offnum;
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrel->latestRemovedXid);
-			pagecounts.tups_vacuumed += 1;
-			pageprunestate->has_dead_items = true;
-		}
-		else
-		{
-			bool		tuple_totally_frozen;
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  vacrel->relfrozenxid,
+									  vacrel->relminmxid,
+									  vacrel->FreezeLimit,
+									  vacrel->MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
+			frozen[nfrozen++].offset = offnum;
 
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			if (heap_prepare_freeze_tuple(tuple.t_data,
-										  vacrel->relfrozenxid,
-										  vacrel->relminmxid,
-										  vacrel->FreezeLimit,
-										  vacrel->MultiXactCutoff,
-										  &frozen[nfrozen],
-										  &tuple_totally_frozen))
-				frozen[nfrozen++].offset = offnum;
+		pagecounts.num_tuples += 1;
+		pageprunestate->hastup = true;
 
-			pagecounts.num_tuples += 1;
-			pageprunestate->hastup = true;
-
-			if (!tuple_totally_frozen)
-				pageprunestate->all_frozen = false;
-		}
+		if (!tuple_totally_frozen)
+			pageprunestate->all_frozen = false;
 	}
 
 	/*
@@ -2148,9 +2107,6 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(vacrel);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2426,14 +2382,25 @@ lazy_vacuum_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ *	lazy_vacuum_page() -- free page's LP_DEAD items listed in the
+ *					 vacrel->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
+ * tupindex is the index in vacrel->dead_tuples of the first dead
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same heap page was visited by lazy_prune_page_items() (i.e. those
+ * whose TID was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
@@ -2469,11 +2436,15 @@ lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2483,12 +2454,19 @@ lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(vacrel->onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(vacrel->onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrel->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2501,10 +2479,10 @@ lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..391caf7396 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -480,8 +480,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 9ac556b4ae..0c4c07503a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -250,14 +250,18 @@ PageAddItemExtended(Page page,
 		/* if no free slot, we'll put it at limit (1st open slot) */
 		if (PageHasFreeLinePointers(phdr))
 		{
-			/*
-			 * Look for "recyclable" (unused) ItemId.  We check for no storage
-			 * as well, just to be paranoid --- unused items should never have
-			 * storage.
-			 */
+			/* Look for "recyclable" (unused) ItemId */
 			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
 			{
 				itemId = PageGetItemId(phdr, offsetNumber);
+
+				/*
+				 * We check for no storage as well, just to be paranoid;
+				 * unused items should never have storage.  Assert() that the
+				 * invariant is respected too.
+				 */
+				Assert(ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId));
+
 				if (!ItemIdIsUsed(itemId) && !ItemIdHasStorage(itemId))
 					break;
 			}
@@ -676,7 +680,9 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
- * As a side effect, the page's PD_HAS_FREE_LINES hint bit is updated.
+ * Caller had better have a super-exclusive lock on page's buffer.  As a side
+ * effect, the page's PD_HAS_FREE_LINES hint bit is updated in cases where our
+ * caller (the heap prune code) sets one or more line pointers unused.
  */
 void
 PageRepairFragmentation(Page page)
@@ -771,7 +777,7 @@ PageRepairFragmentation(Page page)
 		compactify_tuples(itemidbase, nstorage, page, presorted);
 	}
 
-	/* Set hint bit for PageAddItem */
+	/* Set hint bit for PageAddItemExtended */
 	if (nunused > 0)
 		PageSetHasFreeLinePointers(page);
 	else
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0a75dccb93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3554,8 +3554,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3567,9 +3565,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v7-0001-Centralize-state-for-each-VACUUM.patchapplication/octet-stream; name=v7-0001-Centralize-state-for-each-VACUUM.patchDownload

From 903d31c405594254050bf4563477e36b2d3e0e9a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 13 Mar 2021 20:37:32 -0800
Subject: [PATCH v7 1/4] Centralize state for each VACUUM.

Simplify function signatures inside vacuumlazy.c by putting several
frequently used variables in a per-VACUUM state variable.  This makes
the general control flow easier to follow, and reduces clutter.

Also refactor the parallel VACUUM code.
---
 src/include/access/genam.h           |    4 +-
 src/backend/access/heap/vacuumlazy.c | 2135 +++++++++++++-------------
 src/backend/access/index/indexam.c   |    8 +-
 3 files changed, 1109 insertions(+), 1038 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..480a4762f5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,11 +177,11 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
-												IndexBulkDeleteResult *stats,
+												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
 												void *callback_state);
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
-												   IndexBulkDeleteResult *stats);
+												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..b5343d5d78 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -142,12 +142,6 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 
-/*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-
 /* Phases of vacuum during which we report error context. */
 typedef enum
 {
@@ -191,7 +185,7 @@ typedef struct LVShared
 	 * Target table relid and log level.  These fields are not modified during
 	 * the lazy vacuum.
 	 */
-	Oid			relid;
+	Oid			onereloid;
 	int			elevel;
 
 	/*
@@ -264,7 +258,7 @@ typedef struct LVShared
 typedef struct LVSharedIndStats
 {
 	bool		updated;		/* are the stats updated? */
-	IndexBulkDeleteResult stats;
+	IndexBulkDeleteResult istat;
 } LVSharedIndStats;
 
 /* Struct for maintaining a parallel vacuum state. */
@@ -290,13 +284,32 @@ typedef struct LVParallelState
 	int			nindexes_parallel_condcleanup;
 } LVParallelState;
 
-typedef struct LVRelStats
+typedef struct LVRelState
 {
+	/* Target heap relation and its indexes */
+	Relation	onerel;
+	Relation   *indrels;
+	int			nindexes;
+
+	BufferAccessStrategy	vac_strategy;
+
+	/* onerel's initial relfrozenxid and relminmxid */
+	TransactionId	relfrozenxid;
+	MultiXactId		relminmxid;
+
+	/* VACUUM operation's cutoffs for dead items and freezing */
+	TransactionId	OldestXmin;
+	TransactionId	FreezeLimit;
+	MultiXactId		MultiXactCutoff;
+
+	/* Parallel VACUUM state */
+	LVParallelState *lps;
+
 	char	   *relnamespace;
 	char	   *relname;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
-	/* Overall statistics about rel */
+	/* Overall statistics about onerel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* number of pages we examined */
@@ -315,16 +328,15 @@ typedef struct LVRelStats
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 
-	/* Statistics about indexes */
+	/* Statistics output by index AMs */
 	IndexBulkDeleteResult **indstats;
-	int			nindexes;
 
 	/* Used for error callback */
 	char	   *indname;
 	BlockNumber blkno;			/* used only for heap operations */
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
-} LVRelStats;
+} LVRelState;
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -334,77 +346,71 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
-/* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
-static TransactionId OldestXmin;
-static TransactionId FreezeLimit;
-static MultiXactId MultiXactCutoff;
-
-static BufferAccessStrategy vac_strategy;
-
 
 /* non-export function prototypes */
-static void lazy_scan_heap(Relation onerel, VacuumParams *params,
-						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static void lazy_vacuum_heap(LVRelState *vacrel);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
-static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_index(Relation indrel,
-							   IndexBulkDeleteResult **stats,
-							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-							 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
-static bool should_attempt_truncation(VacuumParams *params,
-									  LVRelStats *vacrelstats);
-static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
-static BlockNumber count_nondeletable_pages(Relation onerel,
-											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+									LVRelState *vacrel);
+static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
+													IndexBulkDeleteResult *istat,
+													double reltuples,
+													LVRelState *vacrel);
+static void lazy_cleanup_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_cleanup_one_index(Relation indrel,
+													 IndexBulkDeleteResult *istat,
+													 double reltuples,
+													 bool estimated_count,
+													 LVRelState *vacrel);
+static int	lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno,
+							 Buffer buffer, int tupindex, Buffer *vmbuffer);
+static void update_index_statistics(LVRelState *vacrel);
+static bool should_attempt_truncation(LVRelState *vacrel, VacuumParams
+									  *params);
+static void lazy_truncate_heap(LVRelState *vacrel);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 LVRelStats *vacrelstats,
+static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static void lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-										 LVParallelState *lps, int nindexes);
-static void parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-								  LVDeadTuples *dead_tuples, int nindexes,
-								  LVRelStats *vacrelstats);
-static void vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-								  LVParallelState *lps, int nindexes);
-static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-									 LVParallelState *lps, int nindexes);
+static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
-static int	compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber relblocks);
+static void lazy_space_free(LVRelState *vacrel);
+static int	compute_parallel_vacuum_workers(LVRelState *vacrel,
+											int nrequested,
 											bool *can_parallel_vacuum);
-static void prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-									 int nindexes);
-static void update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-									int nindexes);
-static LVParallelState *begin_parallel_vacuum(Oid relid, Relation *Irel,
-											  LVRelStats *vacrelstats, BlockNumber nblocks,
-											  int nindexes, int nrequested);
-static void end_parallel_vacuum(IndexBulkDeleteResult **stats,
-								LVParallelState *lps, int nindexes);
-static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
-static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static LVParallelState *begin_parallel_vacuum(LVRelState *vacrel,
+											  BlockNumber nblocks,
+											  int nrequested);
+static void end_parallel_vacuum(LVRelState *vacrel);
+static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
+static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
+static void do_parallel_processing(LVRelState *vacrel,
+								   LVShared *lvshared);
+static void do_serial_processing_for_unsafe_indexes(LVRelState *vacrel,
+													LVShared *lvshared);
+static IndexBulkDeleteResult *parallel_process_one_index(Relation indrel,
+														 IndexBulkDeleteResult *istat,
+														 LVShared *lvshared,
+														 LVSharedIndStats *shared_indstats,
+														 LVRelState *vacrel);
+static LVSharedIndStats *parallel_stats_for_idx(LVShared *lvshared, int getidx);
+static bool parallel_processing_is_safe(Relation indrel, LVShared *lvshared);
 static void vacuum_error_callback(void *arg);
-static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info,
+static void update_vacuum_error_info(LVRelState *vacrel,
+									 LVSavedErrInfo *saved_vacrel,
 									 int phase, BlockNumber blkno,
 									 OffsetNumber offnum);
-static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info);
+static void restore_vacuum_error_info(LVRelState *vacrel,
+									  const LVSavedErrInfo *saved_vacrel);
 
 
 /*
@@ -420,8 +426,7 @@ void
 heap_vacuum_rel(Relation onerel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
-	LVRelStats *vacrelstats;
-	Relation   *Irel;
+	LVRelState *vacrel;
 	int			nindexes;
 	PGRUsage	ru0;
 	TimestampTz starttime = 0;
@@ -444,15 +449,14 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
+	TransactionId	OldestXmin;
+	TransactionId	FreezeLimit;
+	MultiXactId		MultiXactCutoff;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -473,8 +477,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(onerel));
 
-	vac_strategy = bstrategy;
-
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -496,35 +498,46 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
-	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
+	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
-	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats->relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats->indname = NULL;
-	vacrelstats->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
-	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
-	vacrelstats->old_live_tuples = onerel->rd_rel->reltuples;
-	vacrelstats->num_index_scans = 0;
-	vacrelstats->pages_removed = 0;
-	vacrelstats->lock_waiter_detected = false;
+	vacrel->onerel = onerel;
+	vacrel->lps = NULL;
+	vacrel->vac_strategy = bstrategy;
+	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
+	vacrel->relminmxid = onerel->rd_rel->relminmxid;
+	vacrel->OldestXmin = OldestXmin;
+	vacrel->FreezeLimit = FreezeLimit;
+	vacrel->MultiXactCutoff = MultiXactCutoff;
+	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel->relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
+	vacrel->old_rel_pages = onerel->rd_rel->relpages;
+	vacrel->old_live_tuples = onerel->rd_rel->reltuples;
+	vacrel->num_index_scans = 0;
+	vacrel->pages_removed = 0;
+	vacrel->lock_waiter_detected = false;
 
 	/* Open all indexes of the relation */
-	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vac_open_indexes(onerel, RowExclusiveLock, &nindexes,
+					 &vacrel->indrels);
+	vacrel->useindex = (nindexes > 0 &&
+						params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
-	vacrelstats->indstats = (IndexBulkDeleteResult **)
+	vacrel->nindexes = nindexes;
+
+	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
-	vacrelstats->nindexes = nindexes;
 
 	/* Save index names iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() &&
 		params->log_min_duration >= 0 &&
-		vacrelstats->nindexes > 0)
+		vacrel->nindexes > 0)
 	{
-		indnames = palloc(sizeof(char *) * vacrelstats->nindexes);
-		for (int i = 0; i < vacrelstats->nindexes; i++)
-			indnames[i] = pstrdup(RelationGetRelationName(Irel[i]));
+		indnames = palloc(sizeof(char *) * vacrel->nindexes);
+		for (int i = 0; i < vacrel->nindexes; i++)
+			indnames[i] =
+				pstrdup(RelationGetRelationName(vacrel->indrels[i]));
 	}
 
 	/*
@@ -539,15 +552,15 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * information is restored at the end of those phases.
 	 */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = vacrelstats;
+	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(vacrel, params, aggressive);
 
 	/* Done with indexes */
-	vac_close_indexes(nindexes, Irel, NoLock);
+	vac_close_indexes(nindexes, vacrel->indrels, NoLock);
 
 	/*
 	 * Compute whether we actually scanned the all unfrozen pages. If we did,
@@ -556,8 +569,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
 	 */
-	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
-		< vacrelstats->rel_pages)
+	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
+		< vacrel->rel_pages)
 	{
 		Assert(!aggressive);
 		scanned_all_unfrozen = false;
@@ -568,17 +581,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/*
 	 * Optionally truncate the relation.
 	 */
-	if (should_attempt_truncation(params, vacrelstats))
+	if (should_attempt_truncation(vacrel, params))
 	{
 		/*
 		 * Update error traceback information.  This is the last phase during
 		 * which we add context information to errors, so we don't need to
 		 * revert to the previous phase.
 		 */
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-								 vacrelstats->nonempty_pages,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
+								 vacrel->nonempty_pages,
 								 InvalidOffsetNumber);
-		lazy_truncate_heap(onerel, vacrelstats);
+		lazy_truncate_heap(vacrel);
 	}
 
 	/* Pop the error context stack */
@@ -602,8 +615,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
 	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrelstats->rel_pages;
-	new_live_tuples = vacrelstats->new_live_tuples;
+	new_rel_pages = vacrel->rel_pages;
+	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(onerel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
@@ -625,7 +638,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(onerel),
 						 onerel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrelstats->new_dead_tuples);
+						 vacrel->new_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -676,39 +689,39 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			}
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
-							 vacrelstats->relnamespace,
-							 vacrelstats->relname,
-							 vacrelstats->num_index_scans);
+							 vacrel->relnamespace,
+							 vacrel->relname,
+							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
-							 vacrelstats->pages_removed,
-							 vacrelstats->rel_pages,
-							 vacrelstats->pinskipped_pages,
-							 vacrelstats->frozenskipped_pages);
+							 vacrel->pages_removed,
+							 vacrel->rel_pages,
+							 vacrel->pinskipped_pages,
+							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"),
-							 vacrelstats->tuples_deleted,
-							 vacrelstats->new_rel_tuples,
-							 vacrelstats->new_dead_tuples,
+							 vacrel->tuples_deleted,
+							 vacrel->new_rel_tuples,
+							 vacrel->new_dead_tuples,
 							 OldestXmin);
 			appendStringInfo(&buf,
 							 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
-			for (int i = 0; i < vacrelstats->nindexes; i++)
+			for (int i = 0; i < vacrel->nindexes; i++)
 			{
-				IndexBulkDeleteResult *stats = vacrelstats->indstats[i];
+				IndexBulkDeleteResult *istat = vacrel->indstats[i];
 
-				if (!stats)
+				if (!istat)
 					continue;
 
 				appendStringInfo(&buf,
 								 _("index \"%s\": pages: %u in total, %u newly deleted, %u currently deleted, %u reusable\n"),
 								 indnames[i],
-								 stats->num_pages,
-								 stats->pages_newly_deleted,
-								 stats->pages_deleted,
-								 stats->pages_free);
+								 istat->num_pages,
+								 istat->pages_newly_deleted,
+								 istat->pages_deleted,
+								 istat->pages_free);
 			}
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 							 read_rate, write_rate);
@@ -737,10 +750,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/* Cleanup index statistics and index names */
-	for (int i = 0; i < vacrelstats->nindexes; i++)
+	for (int i = 0; i < vacrel->nindexes; i++)
 	{
-		if (vacrelstats->indstats[i])
-			pfree(vacrelstats->indstats[i]);
+		if (vacrel->indstats[i])
+			pfree(vacrel->indstats[i]);
 
 		if (indnames && indnames[i])
 			pfree(indnames[i]);
@@ -764,20 +777,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  * which would be after the rows have become inaccessible.
  */
 static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+vacuum_log_cleanup_info(LVRelState *vacrel)
 {
 	/*
 	 * Skip this for relations for which no WAL is to be written, or if we're
 	 * not trying to support archive recovery.
 	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
+	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
 		return;
 
 	/*
 	 * No need to write the record at all unless it contains a valid value
 	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+	if (TransactionIdIsValid(vacrel->latestRemovedXid))
+		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
+									 vacrel->latestRemovedXid);
 }
 
 /*
@@ -809,16 +823,12 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		reference them have been killed.
  */
 static void
-lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
-	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
@@ -847,63 +857,36 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 	else
 		ereport(elevel,
 				(errmsg("vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
-	nblocks = RelationGetNumberOfBlocks(onerel);
-	vacrelstats->rel_pages = nblocks;
-	vacrelstats->scanned_pages = 0;
-	vacrelstats->tupcount_pages = 0;
-	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
+	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
+	vacrel->rel_pages = nblocks;
+	vacrel->scanned_pages = 0;
+	vacrel->tupcount_pages = 0;
+	vacrel->nonempty_pages = 0;
+	vacrel->latestRemovedXid = InvalidTransactionId;
 
-	vistest = GlobalVisTestFor(onerel);
-
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
-	{
-		/*
-		 * Since parallel workers cannot access data in temporary tables, we
-		 * can't perform parallel vacuum on them.
-		 */
-		if (RelationUsesLocalBuffers(onerel))
-		{
-			/*
-			 * Give warning only if the user explicitly tries to perform a
-			 * parallel vacuum on the temporary table.
-			 */
-			if (params->nworkers > 0)
-				ereport(WARNING,
-						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
-								vacrelstats->relname)));
-		}
-		else
-			lps = begin_parallel_vacuum(RelationGetRelid(onerel), Irel,
-										vacrelstats, nblocks, nindexes,
-										params->nworkers);
-	}
+	vistest = GlobalVisTestFor(vacrel->onerel);
 
 	/*
 	 * Allocate the space for dead tuples in case parallel vacuum is not
 	 * initialized.
 	 */
-	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
+	lazy_space_alloc(vacrel, params->nworkers, nblocks);
 
-	dead_tuples = vacrelstats->dead_tuples;
+	dead_tuples = vacrel->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
@@ -956,14 +939,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
 		{
 			uint8		vmstatus;
 
-			vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+			vmstatus = visibilitymap_get_status(vacrel->onerel,
+												next_unskippable_block,
 												&vmbuffer);
 			if (aggressive)
 			{
@@ -1004,11 +987,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* see note above about forcing scanning of last page */
 #define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
+		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
 		if (blkno == next_unskippable_block)
@@ -1021,7 +1004,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				{
 					uint8		vmskipflags;
 
-					vmskipflags = visibilitymap_get_status(onerel,
+					vmskipflags = visibilitymap_get_status(vacrel->onerel,
 														   next_unskippable_block,
 														   &vmbuffer);
 					if (aggressive)
@@ -1053,7 +1036,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
+											 &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
@@ -1077,8 +1061,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * know whether it was all-frozen, so we have to recheck; but
 				 * in this case an approximate answer is OK.
 				 */
-				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-					vacrelstats->frozenskipped_pages++;
+				if (aggressive || VM_ALL_FROZEN(vacrel->onerel, blkno,
+												&vmbuffer))
+					vacrel->frozenskipped_pages++;
 				continue;
 			}
 			all_visible_according_to_vm = true;
@@ -1106,10 +1091,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+			lazy_vacuum_all_indexes(vacrel);
 
 			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
+			lazy_vacuum_heap(vacrel);
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1122,7 +1107,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1137,12 +1122,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
-		visibilitymap_pin(onerel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->onerel, blkno, &vmbuffer);
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, vacrel->vac_strategy);
 
 		/* We need buffer cleanup lock so that we can prune HOT chains. */
 		if (!ConditionalLockBufferForCleanup(buf))
@@ -1156,7 +1140,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			if (!aggressive && !FORCE_CHECK_PAGE())
 			{
 				ReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				continue;
 			}
 
@@ -1177,13 +1161,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrelstats))
+			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
-				vacrelstats->scanned_pages++;
-				vacrelstats->pinskipped_pages++;
+				vacrel->scanned_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			if (!aggressive)
@@ -1193,9 +1177,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * to claiming that the page contains no freezable tuples.
 				 */
 				UnlockReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -1203,8 +1187,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
-		vacrelstats->scanned_pages++;
-		vacrelstats->tupcount_pages++;
+		vacrel->scanned_pages++;
+		vacrel->tupcount_pages++;
 
 		page = BufferGetPage(buf);
 
@@ -1233,12 +1217,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			empty_pages++;
 
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
+			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
 			{
 				Size		freespace;
 
 				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
+				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			}
 			continue;
 		}
@@ -1269,19 +1253,19 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (RelationNeedsWAL(vacrel->onerel) &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 								  vmbuffer, InvalidTransactionId,
 								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
 				END_CRIT_SECTION();
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			continue;
 		}
 
@@ -1291,10 +1275,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * We count tuples removed by the pruning step as removed by VACUUM
 		 * (existing LP_DEAD line pointers don't count).
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
+		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
 										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+										 &vacrel->latestRemovedXid,
+										 &vacrel->offnum);
 
 		/*
 		 * Now scan the page to collect vacuumable items and check for tuples
@@ -1321,7 +1305,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Set the offset number so that we can display it along with any
 			 * error that occurred while processing this tuple.
 			 */
-			vacrelstats->offnum = offnum;
+			vacrel->offnum = offnum;
 			itemid = PageGetItemId(page, offnum);
 
 			/* Unused items require no processing, but we count 'em */
@@ -1361,7 +1345,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
+			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
 			tupgone = false;
 
@@ -1376,7 +1360,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * cases impossible (e.g. in-progress insert from the same
 			 * transaction).
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 			{
 				case HEAPTUPLE_DEAD:
 
@@ -1446,7 +1430,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						 * enough that everyone sees it as committed?
 						 */
 						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
+						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 						{
 							all_visible = false;
 							break;
@@ -1500,7 +1484,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			{
 				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
 				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
+													   &vacrel->latestRemovedXid);
 				tups_vacuumed += 1;
 				has_dead_items = true;
 			}
@@ -1516,8 +1500,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
 				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
+											  vacrel->relfrozenxid,
+											  vacrel->relminmxid,
+											  vacrel->FreezeLimit,
+											  vacrel->MultiXactCutoff,
 											  &frozen[nfrozen],
 											  &tuple_totally_frozen))
 					frozen[nfrozen++].offset = offnum;
@@ -1531,7 +1517,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * Clear the offset information once we have processed all the tuples
 		 * on the page.
 		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
+		vacrel->offnum = InvalidOffsetNumber;
 
 		/*
 		 * If we froze any tuples, mark the buffer dirty, and write a WAL
@@ -1557,12 +1543,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (RelationNeedsWAL(vacrel->onerel))
 			{
 				XLogRecPtr	recptr;
 
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
+				recptr = log_heap_freeze(vacrel->onerel, buf,
+										 vacrel->FreezeLimit, frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
 
@@ -1574,12 +1560,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * doing a second scan. Also we don't do that but forget dead tuples
 		 * when index cleanup is disabled.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
+			if (vacrel->nindexes == 0)
 			{
 				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+				lazy_vacuum_page(vacrel, blkno, buf, 0, &vmbuffer);
 				vacuumed_pages++;
 				has_dead_items = false;
 			}
@@ -1613,7 +1599,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
@@ -1644,7 +1630,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, visibility_cutoff_xid, flags);
 		}
 
@@ -1656,11 +1642,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * that something bad has happened.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+				 vacrel->relname, blkno);
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1682,10 +1668,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		else if (PageIsAllVisible(page) && has_dead_items)
 		{
 			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
+				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1695,14 +1681,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * all_visible is true, so we must check both.
 		 */
 		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
 			 * because setting the all-frozen bit doesn't cause recovery
 			 * conflicts.
 			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1711,7 +1697,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (hastup)
-			vacrelstats->nonempty_pages = blkno + 1;
+			vacrel->nonempty_pages = blkno + 1;
 
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
@@ -1721,33 +1707,32 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * taken if there are no indexes.)
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 	}
 
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
+	vacrel->blkno = InvalidBlockNumber;
 
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrel->tuples_deleted = tups_vacuumed;
+	vacrel->new_dead_tuples = nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
-	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
-														  nblocks,
-														  vacrelstats->tupcount_pages,
-														  live_tuples);
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
+													 vacrel->tupcount_pages,
+													 live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
-	vacrelstats->new_rel_tuples =
-		Max(vacrelstats->new_live_tuples, 0) + vacrelstats->new_dead_tuples;
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1763,10 +1748,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (dead_tuples->num_tuples > 0)
 	{
 		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+		lazy_vacuum_all_indexes(vacrel);
 
 		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
+		lazy_vacuum_heap(vacrel);
 	}
 
 	/*
@@ -1774,47 +1759,43 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * not there were indexes.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
 
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
-		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
+	if (vacrel->useindex)
+		lazy_cleanup_all_indexes(vacrel);
 
-	/*
-	 * End parallel mode before updating index statistics as we cannot write
-	 * during parallel mode.
-	 */
-	if (ParallelVacuumIsActive(lps))
-		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
+	/* Free resources managed by lazy_space_alloc() */
+	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrelstats->useindex)
-		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
+	if (vacrel->useindex)
+		update_index_statistics(vacrel);
 
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						tups_vacuumed, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 nkeep, vacrel->OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
 					 nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
-									vacrelstats->pinskipped_pages),
-					 vacrelstats->pinskipped_pages);
+									vacrel->pinskipped_pages),
+					 vacrel->pinskipped_pages);
 	appendStringInfo(&buf, ngettext("%u frozen page.\n",
 									"%u frozen pages.\n",
-									vacrelstats->frozenskipped_pages),
-					 vacrelstats->frozenskipped_pages);
+									vacrel->frozenskipped_pages),
+					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
@@ -1823,258 +1804,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-					vacrelstats->relname,
+					vacrel->relname,
 					tups_vacuumed, num_tuples,
-					vacrelstats->scanned_pages, nblocks),
+					vacrel->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
-/*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
- *
- * We process the indexes serially unless we are doing parallel vacuum.
- */
-static void
-lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
-{
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
-
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
-	/* Report that we are now vacuuming indexes */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
-
-	/* Perform index vacuuming with parallel workers for parallel vacuum. */
-	if (ParallelVacuumIsActive(lps))
-	{
-		/* Tell parallel workers to do index vacuuming */
-		lps->lvshared->for_cleanup = false;
-		lps->lvshared->first_time = false;
-
-		/*
-		 * We can only provide an approximate value of num_heap_tuples in
-		 * vacuum cases.
-		 */
-		lps->lvshared->reltuples = vacrelstats->old_live_tuples;
-		lps->lvshared->estimated_count = true;
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
-	}
-	else
-	{
-		int			idx;
-
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							  vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
-	}
-
-	/* Increase and report the number of index scans */
-	vacrelstats->num_index_scans++;
-	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
-								 vacrelstats->num_index_scans);
-}
-
-
-/*
- *	lazy_vacuum_heap() -- second pass over the heap
- *
- *		This routine marks dead tuples as unused and compacts out free
- *		space on their pages.  Pages not having dead tuples recorded from
- *		lazy_scan_heap are not visited at all.
- *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
- */
-static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
-{
-	int			tupindex;
-	int			npages;
-	PGRUsage	ru0;
-	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
-
-	pg_rusage_init(&ru0);
-	npages = 0;
-
-	tupindex = 0;
-	while (tupindex < vacrelstats->dead_tuples->num_tuples)
-	{
-		BlockNumber tblk;
-		Buffer		buf;
-		Page		page;
-		Size		freespace;
-
-		vacuum_delay_point();
-
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]);
-		vacrelstats->blkno = tblk;
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
-								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
-									&vmbuffer);
-
-		/* Now that we've compacted the page, record its available space */
-		page = BufferGetPage(buf);
-		freespace = PageGetHeapFreeSpace(page);
-
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace);
-		npages++;
-	}
-
-	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
-
-	if (BufferIsValid(vmbuffer))
-	{
-		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	ereport(elevel,
-			(errmsg("\"%s\": removed %d row versions in %d pages",
-					vacrelstats->relname,
-					tupindex, npages),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-}
-
-/*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
- *
- * Caller must hold pin and buffer cleanup lock on the buffer.
- *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
- * tuple for this page.  We assume the rest follow sequentially.
- * The return value is the first tupindex after the tuples of this page.
- */
-static int
-lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
-{
-	LVDeadTuples *dead_tuples = vacrelstats->dead_tuples;
-	Page		page = BufferGetPage(buffer);
-	OffsetNumber unused[MaxOffsetNumber];
-	int			uncnt = 0;
-	TransactionId visibility_cutoff_xid;
-	bool		all_frozen;
-	LVSavedErrInfo saved_err_info;
-
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 blkno, InvalidOffsetNumber);
-
-	START_CRIT_SECTION();
-
-	for (; tupindex < dead_tuples->num_tuples; tupindex++)
-	{
-		BlockNumber tblk;
-		OffsetNumber toff;
-		ItemId		itemid;
-
-		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
-		itemid = PageGetItemId(page, toff);
-		ItemIdSetUnused(itemid);
-		unused[uncnt++] = toff;
-	}
-
-	PageRepairFragmentation(page);
-
-	/*
-	 * Mark buffer dirty before we write WAL.
-	 */
-	MarkBufferDirty(buffer);
-
-	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
-	{
-		XLogRecPtr	recptr;
-
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
-		PageSetLSN(page, recptr);
-	}
-
-	/*
-	 * End critical section, so we safely can do visibility tests (which
-	 * possibly need to perform IO and allocate memory!). If we crash now the
-	 * page (including the corresponding vm bit) might not be marked all
-	 * visible, but that's fine. A later vacuum will fix that.
-	 */
-	END_CRIT_SECTION();
-
-	/*
-	 * Now that we have removed the dead tuples from the page, once again
-	 * check if the page has become all-visible.  The page is already marked
-	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
-	 */
-	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
-								 &visibility_cutoff_xid,
-								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
-	{
-		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
-		uint8		flags = 0;
-
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
-			flags |= VISIBILITYMAP_ALL_FROZEN;
-
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
-	}
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	return tupindex;
-}
-
 /*
  *	lazy_check_needs_freeze() -- scan page to see if any tuples
  *					 need to be cleaned to avoid wraparound
@@ -2083,7 +1819,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
+lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -2112,7 +1848,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* this should match hastup test in count_nondeletable_pages() */
@@ -2125,363 +1861,79 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 
-		if (heap_tuple_needs_freeze(tupleheader, FreezeLimit,
-									MultiXactCutoff, buf))
+		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
 			break;
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return (offnum <= maxoff);
 }
 
 /*
- * Perform index vacuum or index cleanup with parallel workers.  This function
- * must be used by the parallel vacuum leader process.  The caller must set
- * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
- * cleanup.
- */
-static void
-lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-							 LVParallelState *lps, int nindexes)
-{
-	int			nworkers;
-
-	Assert(!IsParallelWorker());
-	Assert(ParallelVacuumIsActive(lps));
-	Assert(nindexes > 0);
-
-	/* Determine the number of parallel workers to launch */
-	if (lps->lvshared->for_cleanup)
-	{
-		if (lps->lvshared->first_time)
-			nworkers = lps->nindexes_parallel_cleanup +
-				lps->nindexes_parallel_condcleanup;
-		else
-			nworkers = lps->nindexes_parallel_cleanup;
-	}
-	else
-		nworkers = lps->nindexes_parallel_bulkdel;
-
-	/* The leader process will participate */
-	nworkers--;
-
-	/*
-	 * It is possible that parallel context is initialized with fewer workers
-	 * than the number of indexes that need a separate worker in the current
-	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
-	 */
-	nworkers = Min(nworkers, lps->pcxt->nworkers);
-
-	/* Setup the shared cost-based vacuum delay and launch workers */
-	if (nworkers > 0)
-	{
-		if (vacrelstats->num_index_scans > 0)
-		{
-			/* Reset the parallel index processing counter */
-			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
-
-			/* Reinitialize the parallel context to relaunch parallel workers */
-			ReinitializeParallelDSM(lps->pcxt);
-		}
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(lps->pcxt, nworkers);
-
-		LaunchParallelWorkers(lps->pcxt);
-
-		if (lps->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
-			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
-		}
-
-		if (lps->lvshared->for_cleanup)
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
-									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-		else
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
-									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-	}
-
-	/* Process the indexes that can be processed by only leader process */
-	vacuum_indexes_leader(Irel, vacrelstats, lps, nindexes);
-
-	/*
-	 * Join as a parallel worker.  The leader process alone processes all the
-	 * indexes in the case where no workers are launched.
-	 */
-	parallel_vacuum_index(Irel, lps->lvshared, vacrelstats->dead_tuples,
-						  nindexes, vacrelstats);
-
-	/*
-	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
-	 * to finish, or we might get incomplete data.)
-	 */
-	if (nworkers > 0)
-	{
-		int			i;
-
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(lps->pcxt);
-
-		for (i = 0; i < lps->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
-	}
-
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
-}
-
-/*
- * Index vacuum/cleanup routine used by the leader process and parallel
- * vacuum worker processes to process the indexes in parallel.
- */
-static void
-parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-					  LVDeadTuples *dead_tuples, int nindexes,
-					  LVRelStats *vacrelstats)
-{
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	/* Loop until all indexes are vacuumed */
-	for (;;)
-	{
-		int			idx;
-		LVSharedIndStats *shared_indstats;
-
-		/* Get an index number to process */
-		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
-
-		/* Done for all indexes? */
-		if (idx >= nindexes)
-			break;
-
-		/* Get the index statistics of this index from DSM */
-		shared_indstats = get_indstats(lvshared, idx);
-
-		/*
-		 * Skip processing indexes that don't participate in parallel
-		 * operation
-		 */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[idx], lvshared))
-			continue;
-
-		/* Do vacuum or cleanup of the index */
-		vacuum_one_index(Irel[idx], &(vacrelstats->indstats[idx]), lvshared,
-						 shared_indstats, dead_tuples, vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup indexes that can be processed by only the leader process
- * because these indexes don't support parallel operation at that phase.
- */
-static void
-vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-					  LVParallelState *lps, int nindexes)
-{
-	int			i;
-
-	Assert(!IsParallelWorker());
-
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	for (i = 0; i < nindexes; i++)
-	{
-		LVSharedIndStats *shared_indstats;
-
-		shared_indstats = get_indstats(lps->lvshared, i);
-
-		/* Process the indexes skipped by parallel workers */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
-			vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared,
-							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup index either by leader process or by one of the worker
- * process.  After processing the index this function copies the index
- * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
- * segment.
- */
-static void
-vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
-{
-	IndexBulkDeleteResult *bulkdelete_res = NULL;
-
-	if (shared_indstats)
-	{
-		/* Get the space for IndexBulkDeleteResult */
-		bulkdelete_res = &(shared_indstats->stats);
-
-		/*
-		 * Update the pointer to the corresponding bulk-deletion result if
-		 * someone has already updated it.
-		 */
-		if (shared_indstats->updated && *stats == NULL)
-			*stats = bulkdelete_res;
-	}
-
-	/* Do vacuum or cleanup of the index */
-	if (lvshared->for_cleanup)
-		lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-						   lvshared->estimated_count, vacrelstats);
-	else
-		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
-
-	/*
-	 * Copy the index bulk-deletion result returned from ambulkdelete and
-	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
-	 * allocate locally and it's possible that an index will be vacuumed by a
-	 * different vacuum process the next cycle.  Copying the result normally
-	 * happens only the first time an index is vacuumed.  For any additional
-	 * vacuum pass, we directly point to the result on the DSM segment and
-	 * pass it to vacuum index APIs so that workers can update it directly.
-	 *
-	 * Since all vacuum workers write the bulk-deletion result at different
-	 * slots we can write them without locking.
-	 */
-	if (shared_indstats && !shared_indstats->updated && *stats != NULL)
-	{
-		memcpy(bulkdelete_res, *stats, sizeof(IndexBulkDeleteResult));
-		shared_indstats->updated = true;
-
-		/*
-		 * Now that stats[idx] points to the DSM segment, we don't need the
-		 * locally allocated results.
-		 */
-		pfree(*stats);
-		*stats = bulkdelete_res;
-	}
-}
-
-/*
- *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
- * Cleanup indexes.  We process the indexes serially unless we are doing
- * parallel vacuum.
+ * Should only be called through lazy_vacuum_all_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
  */
 static void
-lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-						 LVParallelState *lps, int nindexes)
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-	int			idx;
+	Assert(vacrel->nindexes > 0);
+	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
+	/* Log cleanup info before we touch indexes */
+	vacuum_log_cleanup_info(vacrel);
 
-	/* Report that we are now cleaning up indexes */
+	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
-	/*
-	 * If parallel vacuum is active we perform index cleanup with parallel
-	 * workers.
-	 */
-	if (ParallelVacuumIsActive(lps))
+	if (!vacrel->lps)
 	{
-		/* Tell parallel workers to do index cleanup */
-		lps->lvshared->for_cleanup = true;
-		lps->lvshared->first_time =
-			(vacrelstats->num_index_scans == 0);
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-		/*
-		 * Now we can provide a better estimate of total number of surviving
-		 * tuples (we assume indexes are more interested in that than in the
-		 * number of nominally live tuples).
-		 */
-		lps->lvshared->reltuples = vacrelstats->new_rel_tuples;
-		lps->lvshared->estimated_count =
-			(vacrelstats->tupcount_pages < vacrelstats->rel_pages);
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+			vacrel->indstats[idx] =
+				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
+									  vacrel);
+		}
 	}
 	else
 	{
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_cleanup_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							   vacrelstats->new_rel_tuples,
-							   vacrelstats->tupcount_pages < vacrelstats->rel_pages,
-							   vacrelstats);
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
+
+	/* Increase and report the number of index scans */
+	vacrel->num_index_scans++;
+	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
+								 vacrel->num_index_scans);
 }
 
 /*
- *	lazy_vacuum_index() -- vacuum one index relation.
+ *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index entries pointing to tuples listed in
  *		dead_tuples, and update running statistics.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					  double reltuples, LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2495,7 +1947,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->vac_strategy;
 
 	/*
 	 * Update error traceback information.
@@ -2503,38 +1955,79 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_VACUUM_INDEX,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	*stats = index_bulk_delete(&ivinfo, *stats,
-							   lazy_tid_reaped, (void *) dead_tuples);
+	istat = index_bulk_delete(&ivinfo, istat, lazy_tid_reaped,
+							  (void *) vacrel->dead_tuples);
 
 	ereport(elevel,
 			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
+					vacrel->indname, vacrel->dead_tuples->num_tuples),
 			 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
 }
 
 /*
- *	lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
+ *
+ * Cleanup indexes.  We process the indexes serially unless we are doing
+ * parallel vacuum.
+ */
+static void
+lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	Assert(vacrel->nindexes > 0);
+
+	/* Report that we are now cleaning up indexes */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+
+	if (!vacrel->lps)
+	{
+		double		reltuples = vacrel->new_rel_tuples;
+		bool		estimated_count =
+		vacrel->tupcount_pages < vacrel->rel_pages;
+
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
+
+			vacrel->indstats[idx] =
+				lazy_cleanup_one_index(indrel, istat, reltuples,
+									   estimated_count, vacrel);
+		}
+	}
+	else
+	{
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_cleanup_all_indexes(vacrel);
+	}
+}
+
+/*
+ *	lazy_cleanup_one_index() -- do post-vacuum cleanup for index relation.
  *
  *		reltuples is the number of heap tuples and estimated_count is true
  *		if reltuples is an estimated value.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_cleanup_index(Relation indrel,
-				   IndexBulkDeleteResult **stats,
-				   double reltuples, bool estimated_count, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					   double reltuples, bool estimated_count,
+					   LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2549,7 +2042,7 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.message_level = elevel;
 
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->vac_strategy;
 
 	/*
 	 * Update error traceback information.
@@ -2557,35 +2050,259 @@ lazy_cleanup_index(Relation indrel,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	*stats = index_vacuum_cleanup(&ivinfo, *stats);
+	istat = index_vacuum_cleanup(&ivinfo, istat);
 
-	if (*stats)
+	if (istat)
 	{
 		ereport(elevel,
 				(errmsg("index \"%s\" now contains %.0f row versions in %u pages",
 						RelationGetRelationName(indrel),
-						(*stats)->num_index_tuples,
-						(*stats)->num_pages),
+						(istat)->num_index_tuples,
+						(istat)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
 						   "%u index pages were newly deleted.\n"
 						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
-						   (*stats)->tuples_removed,
-						   (*stats)->pages_newly_deleted,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (istat)->tuples_removed,
+						   (istat)->pages_newly_deleted,
+						   (istat)->pages_deleted, (istat)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
+}
+
+/*
+ *	lazy_vacuum_heap() -- second pass over the heap for two pass strategy
+ *
+ *		This routine marks dead tuples as unused and compacts out free
+ *		space on their pages.  Pages not having dead tuples recorded from
+ *		lazy_scan_heap are not visited at all.
+ *
+ * Should only be called through lazy_vacuum_all_pruned_items().
+ *
+ * We don't need a latestRemovedXid value for recovery conflicts here -- we
+ * rely on conflicts from heap pruning instead (i.e. a heap_page_prune() call
+ * that took place earlier, usually though not always during the ongoing
+ * VACUUM operation).
+ */
+static void
+lazy_vacuum_heap(LVRelState *vacrel)
+{
+	int			tupindex;
+	int			npages;
+	PGRUsage	ru0;
+	Buffer		vmbuffer = InvalidBuffer;
+	LVSavedErrInfo saved_err_info;
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	pg_rusage_init(&ru0);
+	npages = 0;
+
+	tupindex = 0;
+	while (tupindex < vacrel->dead_tuples->num_tuples)
+	{
+		BlockNumber tblk;
+		Buffer		buf;
+		Page		page;
+		Size		freespace;
+
+		vacuum_delay_point();
+
+		tblk = ItemPointerGetBlockNumber(&vacrel->dead_tuples->itemptrs[tupindex]);
+		vacrel->blkno = tblk;
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, tblk,
+								 RBM_NORMAL, vacrel->vac_strategy);
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		tupindex = lazy_vacuum_page(vacrel, tblk, buf, tupindex,
+									&vmbuffer);
+
+		/* Now that we've compacted the page, record its available space */
+		page = BufferGetPage(buf);
+		freespace = PageGetHeapFreeSpace(page);
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->onerel, tblk, freespace);
+		npages++;
+	}
+
+	/* Clear the block number information */
+	vacrel->blkno = InvalidBlockNumber;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	ereport(elevel,
+			(errmsg("\"%s\": removed %d row versions in %d pages",
+					vacrel->relname, tupindex, npages),
+			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+}
+
+/*
+ *	lazy_vacuum_page() -- free dead tuples on a page
+ *					 and repair its fragmentation.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * tupindex is the index in vacrelstats->dead_tuples of the first dead
+ * tuple for this page.  We assume the rest follow sequentially.
+ * The return value is the first tupindex after the tuples of this page.
+ */
+static int
+lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
+				 int tupindex, Buffer *vmbuffer)
+{
+	LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+	Page		page = BufferGetPage(buffer);
+	OffsetNumber unused[MaxOffsetNumber];
+	int			uncnt = 0;
+	TransactionId visibility_cutoff_xid;
+	bool		all_frozen;
+	LVSavedErrInfo saved_err_info;
+
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
+							 InvalidOffsetNumber);
+
+	START_CRIT_SECTION();
+
+	for (; tupindex < dead_tuples->num_tuples; tupindex++)
+	{
+		BlockNumber tblk;
+		OffsetNumber toff;
+		ItemId		itemid;
+
+		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
+		if (tblk != blkno)
+			break;				/* past end of tuples for this block */
+		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
+		itemid = PageGetItemId(page, toff);
+		ItemIdSetUnused(itemid);
+		unused[uncnt++] = toff;
+	}
+
+	PageRepairFragmentation(page);
+
+	/*
+	 * Mark buffer dirty before we write WAL.
+	 */
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(vacrel->onerel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_clean(vacrel->onerel, buffer,
+								NULL, 0, NULL, 0,
+								unused, uncnt,
+								vacrel->latestRemovedXid);
+		PageSetLSN(page, recptr);
+	}
+
+	/*
+	 * End critical section, so we safely can do visibility tests (which
+	 * possibly need to perform IO and allocate memory!). If we crash now the
+	 * page (including the corresponding vm bit) might not be marked all
+	 * visible, but that's fine. A later vacuum will fix that.
+	 */
+	END_CRIT_SECTION();
+
+	/*
+	 * Now that we have removed the dead tuples from the page, once again
+	 * check if the page has become all-visible.  The page is already marked
+	 * dirty, exclusively locked, and, if needed, a full page image has been
+	 * emitted in the log_heap_clean() above.
+	 */
+	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
+								 &all_frozen))
+		PageSetAllVisible(page);
+
+	/*
+	 * All the changes to the heap page have been done. If the all-visible
+	 * flag is now set, also set the VM all-visible bit (and, if possible, the
+	 * all-frozen bit) unless this has already been done previously.
+	 */
+	if (PageIsAllVisible(page))
+	{
+		uint8		vm_status = visibilitymap_get_status(vacrel->onerel, blkno, vmbuffer);
+		uint8		flags = 0;
+
+		/* Set the VM all-frozen bit to flag, if needed */
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			flags |= VISIBILITYMAP_ALL_VISIBLE;
+		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		Assert(BufferIsValid(*vmbuffer));
+		if (flags != 0)
+			visibilitymap_set(vacrel->onerel, blkno, buffer, InvalidXLogRecPtr,
+							  *vmbuffer, visibility_cutoff_xid, flags);
+	}
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	return tupindex;
+}
+
+/*
+ * Update index statistics in pg_class if the statistics are accurate.
+ */
+static void
+update_index_statistics(LVRelState *vacrel)
+{
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+
+	Assert(!IsInParallelMode());
+
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		Relation	indrel = indrels[idx];
+		IndexBulkDeleteResult *istat = indstats[idx];
+
+		if (istat == NULL || istat->estimated_count)
+			continue;
+
+		/* Update index statistics */
+		vac_update_relstats(indrel,
+							istat->num_pages,
+							istat->num_index_tuples,
+							0,
+							false,
+							InvalidTransactionId,
+							InvalidMultiXactId,
+							false);
+	}
 }
 
 /*
@@ -2608,17 +2325,17 @@ lazy_cleanup_index(Relation indrel,
  * careful to depend only on fields that lazy_scan_heap updates on-the-fly.
  */
 static bool
-should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
+should_attempt_truncation(LVRelState *vacrel, VacuumParams *params)
 {
 	BlockNumber possibly_freeable;
 
 	if (params->truncate == VACOPT_TERNARY_DISABLED)
 		return false;
 
-	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
+		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION) &&
 		old_snapshot_threshold < 0)
 		return true;
 	else
@@ -2629,9 +2346,10 @@ should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
  * lazy_truncate_heap - try to truncate off any empty pages at the end
  */
 static void
-lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
+lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber old_rel_pages = vacrelstats->rel_pages;
+	Relation	onerel = vacrel->onerel;
+	BlockNumber old_rel_pages = vacrel->rel_pages;
 	BlockNumber new_rel_pages;
 	int			lock_retry;
 
@@ -2655,7 +2373,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * (which is quite possible considering we already hold a lower-grade
 		 * lock).
 		 */
-		vacrelstats->lock_waiter_detected = false;
+		vacrel->lock_waiter_detected = false;
 		lock_retry = 0;
 		while (true)
 		{
@@ -2675,10 +2393,10 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 				 * We failed to establish the lock in the specified number of
 				 * retries. This means we give up truncating.
 				 */
-				vacrelstats->lock_waiter_detected = true;
+				vacrel->lock_waiter_detected = true;
 				ereport(elevel,
 						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
-								vacrelstats->relname)));
+								vacrel->relname)));
 				return;
 			}
 
@@ -2694,11 +2412,11 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		if (new_rel_pages != old_rel_pages)
 		{
 			/*
-			 * Note: we intentionally don't update vacrelstats->rel_pages with
-			 * the new rel size here.  If we did, it would amount to assuming
-			 * that the new pages are empty, which is unlikely. Leaving the
-			 * numbers alone amounts to assuming that the new pages have the
-			 * same tuple density as existing ones, which is less unlikely.
+			 * Note: we intentionally don't update vacrel->rel_pages with the
+			 * new rel size here.  If we did, it would amount to assuming that
+			 * the new pages are empty, which is unlikely. Leaving the numbers
+			 * alone amounts to assuming that the new pages have the same
+			 * tuple density as existing ones, which is less unlikely.
 			 */
 			UnlockRelation(onerel, AccessExclusiveLock);
 			return;
@@ -2710,8 +2428,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
-		vacrelstats->blkno = new_rel_pages;
+		new_rel_pages = count_nondeletable_pages(vacrel);
+		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
 		{
@@ -2739,18 +2457,18 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
-		vacrelstats->rel_pages = new_rel_pages;
+		vacrel->pages_removed += old_rel_pages - new_rel_pages;
+		vacrel->rel_pages = new_rel_pages;
 
 		ereport(elevel,
 				(errmsg("\"%s\": truncated %u to %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						old_rel_pages, new_rel_pages),
 				 errdetail_internal("%s",
 									pg_rusage_show(&ru0))));
 		old_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrelstats->nonempty_pages &&
-			 vacrelstats->lock_waiter_detected);
+	} while (new_rel_pages > vacrel->nonempty_pages &&
+			 vacrel->lock_waiter_detected);
 }
 
 /*
@@ -2759,8 +2477,9 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
+count_nondeletable_pages(LVRelState *vacrel)
 {
+	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
 	instr_time	starttime;
@@ -2774,11 +2493,11 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrelstats->rel_pages;
+	blkno = vacrel->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrelstats->nonempty_pages)
+	while (blkno > vacrel->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2809,9 +2528,9 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 				{
 					ereport(elevel,
 							(errmsg("\"%s\": suspending truncate due to conflicting lock request",
-									vacrelstats->relname)));
+									vacrel->relname)));
 
-					vacrelstats->lock_waiter_detected = true;
+					vacrel->lock_waiter_detected = true;
 					return blkno;
 				}
 				starttime = currenttime;
@@ -2842,8 +2561,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			prefetchedUntil = prefetchStart;
 		}
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 vacrel->vac_strategy);
 
 		/* In this phase we only need shared access to the buffer */
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
@@ -2891,21 +2610,21 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrelstats->nonempty_pages;
+	return vacrel->nonempty_pages;
 }
 
 /*
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2930,18 +2649,62 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelState *vacrel, int nworkers, BlockNumber nblocks)
 {
-	LVDeadTuples *dead_tuples = NULL;
-	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	/*
+	 * Initialize state for a parallel vacuum.  As of now, only one worker can
+	 * be used for an index, so we invoke parallelism only if there are at
+	 * least two indexes on a table.
+	 */
+	if (nworkers >= 0 && vacrel->nindexes > 1)
+	{
+		/*
+		 * Since parallel workers cannot access data in temporary tables, we
+		 * can't perform parallel vacuum on them.
+		 */
+		if (RelationUsesLocalBuffers(vacrel->onerel))
+		{
+			/*
+			 * Give warning only if the user explicitly tries to perform a
+			 * parallel vacuum on the temporary table.
+			 */
+			if (nworkers > 0)
+				ereport(WARNING,
+						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
+								vacrel->relname)));
+		}
+		else
+			vacrel->lps = begin_parallel_vacuum(vacrel, nblocks, nworkers);
+	}
 
-	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
-	dead_tuples->num_tuples = 0;
-	dead_tuples->max_tuples = (int) maxtuples;
+	if (vacrel->lps == NULL)
+	{
+		LVDeadTuples *dead_tuples;
+		long		maxtuples;
 
-	vacrelstats->dead_tuples = dead_tuples;
+		maxtuples = compute_max_dead_tuples(nblocks, vacrel->nindexes > 0);
+
+		dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
+		dead_tuples->num_tuples = 0;
+		dead_tuples->max_tuples = (int) maxtuples;
+
+		vacrel->dead_tuples = dead_tuples;
+	}
+}
+
+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+	if (!vacrel->lps)
+		return;
+
+	/*
+	 * End parallel mode before updating index statistics as we cannot write
+	 * during parallel mode.
+	 */
+	end_parallel_vacuum(vacrel);
 }
 
 /*
@@ -3039,8 +2802,7 @@ vac_cmp_itemptr(const void *left, const void *right)
  * on this page is frozen.
  */
 static bool
-heap_page_is_all_visible(Relation rel, Buffer buf,
-						 LVRelStats *vacrelstats,
+heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 						 TransactionId *visibility_cutoff_xid,
 						 bool *all_frozen)
 {
@@ -3069,7 +2831,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* Unused or redirect line pointers are of no interest */
@@ -3093,9 +2855,9 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 		tuple.t_len = ItemIdGetLength(itemid);
-		tuple.t_tableOid = RelationGetRelid(rel);
+		tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3114,7 +2876,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
@@ -3148,7 +2910,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return all_visible;
 }
@@ -3167,14 +2929,13 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * vacuum.
  */
 static int
-compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+compute_parallel_vacuum_workers(LVRelState *vacrel, int nrequested,
 								bool *can_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
 	int			parallel_workers;
-	int			i;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -3186,15 +2947,16 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = vacrel->indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		if (vacoptions == VACUUM_OPTION_NO_PARALLEL ||
-			RelationGetNumberOfBlocks(Irel[i]) < min_parallel_index_scan_size)
+			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		can_parallel_vacuum[i] = true;
+		can_parallel_vacuum[idx] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -3223,70 +2985,19 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	return parallel_workers;
 }
 
-/*
- * Initialize variables for shared index statistics, set NULL bitmap and the
- * size of stats for each index.
- */
-static void
-prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-						 int nindexes)
-{
-	int			i;
-
-	/* Currently, we don't support parallel vacuum for autovacuum */
-	Assert(!IsAutoVacuumWorkerProcess());
-
-	/* Set NULL for all indexes */
-	memset(lvshared->bitmap, 0x00, BITMAPLEN(nindexes));
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (!can_parallel_vacuum[i])
-			continue;
-
-		/* Set NOT NULL as this index does support parallelism */
-		lvshared->bitmap[i >> 3] |= 1 << (i & 0x07);
-	}
-}
-
-/*
- * Update index statistics in pg_class if the statistics are accurate.
- */
-static void
-update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-						int nindexes)
-{
-	int			i;
-
-	Assert(!IsInParallelMode());
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (stats[i] == NULL || stats[i]->estimated_count)
-			continue;
-
-		/* Update index statistics */
-		vac_update_relstats(Irel[i],
-							stats[i]->num_pages,
-							stats[i]->num_index_tuples,
-							0,
-							false,
-							InvalidTransactionId,
-							InvalidMultiXactId,
-							false);
-	}
-}
-
 /*
  * This function prepares and returns parallel vacuum state if we can launch
  * even one worker.  This function is responsible for entering parallel mode,
  * create a parallel context, and then initialize the DSM segment.
  */
 static LVParallelState *
-begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
-					  BlockNumber nblocks, int nindexes, int nrequested)
+begin_parallel_vacuum(LVRelState *vacrel, BlockNumber nblocks,
+					  int nrequested)
 {
 	LVParallelState *lps = NULL;
+	Relation	onerel = vacrel->onerel;
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
 	ParallelContext *pcxt;
 	LVShared   *shared;
 	LVDeadTuples *dead_tuples;
@@ -3299,7 +3010,6 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
-	int			i;
 
 	/*
 	 * A parallel vacuum must be requested and there must be indexes on the
@@ -3312,7 +3022,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	can_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = compute_parallel_vacuum_workers(Irel, nindexes,
+	parallel_workers = compute_parallel_vacuum_workers(vacrel,
 													   nrequested,
 													   can_parallel_vacuum);
 
@@ -3333,9 +3043,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 
 	/* Estimate size for shared information -- PARALLEL_VACUUM_KEY_SHARED */
 	est_shared = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		/*
 		 * Cleanup option should be either disabled, always performing in
@@ -3346,10 +3057,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
 		/* Skip indexes that don't participate in parallel vacuum */
-		if (!can_parallel_vacuum[i])
+		if (!can_parallel_vacuum[idx])
 			continue;
 
-		if (Irel[i]->rd_indam->amusemaintenanceworkmem)
+		if (indrel->rd_indam->amusemaintenanceworkmem)
 			nindexes_mwm++;
 
 		est_shared = add_size(est_shared, sizeof(LVSharedIndStats));
@@ -3404,7 +3115,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	/* Prepare shared information */
 	shared = (LVShared *) shm_toc_allocate(pcxt->toc, est_shared);
 	MemSet(shared, 0, est_shared);
-	shared->relid = relid;
+	shared->onereloid = RelationGetRelid(onerel);
 	shared->elevel = elevel;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
@@ -3415,7 +3126,20 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
 	pg_atomic_init_u32(&(shared->idx), 0);
 	shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
+
+	/*
+	 * Initialize variables for shared index statistics, set NULL bitmap and
+	 * the size of stats for each index.
+	 */
+	memset(shared->bitmap, 0x00, BITMAPLEN(nindexes));
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		if (!can_parallel_vacuum[idx])
+			continue;
+
+		/* Set NOT NULL as this index does support parallelism */
+		shared->bitmap[idx >> 3] |= 1 << (idx & 0x07);
+	}
 
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	lps->lvshared = shared;
@@ -3426,7 +3150,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	dead_tuples->num_tuples = 0;
 	MemSet(dead_tuples->itemptrs, 0, sizeof(ItemPointerData) * maxtuples);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_TUPLES, dead_tuples);
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
 
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
@@ -3467,32 +3191,35 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
  * context, but that won't be safe (see ExitParallelMode).
  */
 static void
-end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
-					int nindexes)
+end_parallel_vacuum(LVRelState *vacrel)
 {
-	int			i;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+	LVParallelState *lps = vacrel->lps;
+	int			nindexes = vacrel->nindexes;
 
 	Assert(!IsParallelWorker());
 
 	/* Copy the updated statistics */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		LVSharedIndStats *indstats = get_indstats(lps->lvshared, i);
+		LVSharedIndStats *shared_istat;
+
+		shared_istat = parallel_stats_for_idx(lps->lvshared, idx);
 
 		/*
 		 * Skip unused slot.  The statistics of this index are already stored
 		 * in local memory.
 		 */
-		if (indstats == NULL)
+		if (shared_istat == NULL)
 			continue;
 
-		if (indstats->updated)
+		if (shared_istat->updated)
 		{
-			stats[i] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-			memcpy(stats[i], &(indstats->stats), sizeof(IndexBulkDeleteResult));
+			indstats[idx] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+			memcpy(indstats[idx], &(shared_istat->istat), sizeof(IndexBulkDeleteResult));
 		}
 		else
-			stats[i] = NULL;
+			indstats[idx] = NULL;
 	}
 
 	DestroyParallelContext(lps->pcxt);
@@ -3503,20 +3230,361 @@ end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
 	lps = NULL;
 }
 
-/* Return the Nth index statistics or NULL */
-static LVSharedIndStats *
-get_indstats(LVShared *lvshared, int n)
+static void
+do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel)
+{
+	/* Tell parallel workers to do index vacuuming */
+	vacrel->lps->lvshared->for_cleanup = false;
+	vacrel->lps->lvshared->first_time = false;
+
+	/*
+	 * We can only provide an approximate value of num_heap_tuples in vacuum
+	 * cases.
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->old_live_tuples;
+	vacrel->lps->lvshared->estimated_count = true;
+
+	do_parallel_vacuum_or_cleanup(vacrel,
+								  vacrel->lps->nindexes_parallel_bulkdel);
+}
+
+static void
+do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	/*
+	 * If parallel vacuum is active we perform index cleanup with parallel
+	 * workers.
+	 *
+	 * Tell parallel workers to do index cleanup.
+	 */
+	vacrel->lps->lvshared->for_cleanup = true;
+	vacrel->lps->lvshared->first_time = (vacrel->num_index_scans == 0);
+
+	/*
+	 * Now we can provide a better estimate of total number of surviving
+	 * tuples (we assume indexes are more interested in that than in the
+	 * number of nominally live tuples).
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
+	vacrel->lps->lvshared->estimated_count =
+		(vacrel->tupcount_pages < vacrel->rel_pages);
+
+	/* Determine the number of parallel workers to launch */
+	if (vacrel->lps->lvshared->first_time)
+		nworkers = vacrel->lps->nindexes_parallel_cleanup +
+			vacrel->lps->nindexes_parallel_condcleanup;
+	else
+		nworkers = vacrel->lps->nindexes_parallel_cleanup;
+
+	do_parallel_vacuum_or_cleanup(vacrel, nworkers);
+}
+
+/*
+ * Perform index vacuum or index cleanup with parallel workers.  This function
+ * must be used by the parallel vacuum leader process.  The caller must set
+ * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
+ * cleanup.
+ */
+static void
+do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers)
+{
+	LVParallelState *lps = vacrel->lps;
+
+	Assert(!IsParallelWorker());
+	Assert(vacrel->nindexes > 0);
+
+	/* The leader process will participate */
+	nworkers--;
+
+	/*
+	 * It is possible that parallel context is initialized with fewer workers
+	 * than the number of indexes that need a separate worker in the current
+	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
+	 */
+	nworkers = Min(nworkers, lps->pcxt->nworkers);
+
+	/* Setup the shared cost-based vacuum delay and launch workers */
+	if (nworkers > 0)
+	{
+		if (vacrel->num_index_scans > 0)
+		{
+			/* Reset the parallel index processing counter */
+			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
+
+			/* Reinitialize the parallel context to relaunch parallel workers */
+			ReinitializeParallelDSM(lps->pcxt);
+		}
+
+		/*
+		 * Set up shared cost balance and the number of active workers for
+		 * vacuum delay.  We need to do this before launching workers as
+		 * otherwise, they might not see the updated values for these
+		 * parameters.
+		 */
+		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
+		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
+
+		/*
+		 * The number of workers can vary between bulkdelete and cleanup
+		 * phase.
+		 */
+		ReinitializeParallelWorkers(lps->pcxt, nworkers);
+
+		LaunchParallelWorkers(lps->pcxt);
+
+		if (lps->pcxt->nworkers_launched > 0)
+		{
+			/*
+			 * Reset the local cost values for leader backend as we have
+			 * already accumulated the remaining balance of heap.
+			 */
+			VacuumCostBalance = 0;
+			VacuumCostBalanceLocal = 0;
+
+			/* Enable shared cost balance for leader backend */
+			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
+			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
+		}
+
+		if (lps->lvshared->for_cleanup)
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
+									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+		else
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
+									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+	}
+
+	/* Process the indexes that can be processed by only leader process */
+	do_serial_processing_for_unsafe_indexes(vacrel, lps->lvshared);
+
+	/*
+	 * Join as a parallel worker.  The leader process alone processes all the
+	 * indexes in the case where no workers are launched.
+	 */
+	do_parallel_processing(vacrel, lps->lvshared);
+
+	/*
+	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
+	 * to finish, or we might get incomplete data.)
+	 */
+	if (nworkers > 0)
+	{
+		/* Wait for all vacuum workers to finish */
+		WaitForParallelWorkersToFinish(lps->pcxt);
+
+		for (int i = 0; i < lps->pcxt->nworkers_launched; i++)
+			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
+	}
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+}
+
+/*
+ * Index vacuum/cleanup routine used by the leader process and parallel
+ * vacuum worker processes to process the indexes in parallel.
+ */
+static void
+do_parallel_processing(LVRelState *vacrel, LVShared *lvshared)
+{
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Loop until all indexes are vacuumed */
+	for (;;)
+	{
+		int			idx;
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		/* Get an index number to process */
+		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
+
+		/* Done for all indexes? */
+		if (idx >= vacrel->nindexes)
+			break;
+
+		/* Get the index statistics of this index from DSM */
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip indexes not participating in parallelism */
+		if (shared_istat == NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * Skip processing indexes that are unsafe for workers (these are
+		 * processed in do_serial_processing_for_unsafe_indexes() by leader)
+		 */
+		if (!parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup indexes that can be processed by only the leader process
+ * because these indexes don't support parallel operation at that phase.
+ */
+static void
+do_serial_processing_for_unsafe_indexes(LVRelState *vacrel, LVShared *lvshared)
+{
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
+	{
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip already-complete indexes */
+		if (shared_istat != NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * We're only here for the unsafe indexes
+		 */
+		if (parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup index either by leader process or by one of the worker
+ * process.  After processing the index this function copies the index
+ * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
+ * segment.
+ */
+static IndexBulkDeleteResult *
+parallel_process_one_index(Relation indrel,
+						   IndexBulkDeleteResult *istat,
+						   LVShared *lvshared,
+						   LVSharedIndStats *shared_istat,
+						   LVRelState *vacrel)
+{
+	IndexBulkDeleteResult *bulkdelete_res = NULL;
+
+	if (shared_istat)
+	{
+		/* Get the space for IndexBulkDeleteResult */
+		bulkdelete_res = &(shared_istat->istat);
+
+		/*
+		 * Update the pointer to the corresponding bulk-deletion result if
+		 * someone has already updated it.
+		 */
+		if (shared_istat->updated && istat == NULL)
+			istat = bulkdelete_res;
+	}
+
+	/* Do vacuum or cleanup of the index */
+	if (lvshared->for_cleanup)
+		istat = lazy_cleanup_one_index(indrel, istat, lvshared->reltuples,
+									   lvshared->estimated_count, vacrel);
+	else
+		istat = lazy_vacuum_one_index(indrel, istat, lvshared->reltuples,
+									  vacrel);
+
+	/*
+	 * Copy the index bulk-deletion result returned from ambulkdelete and
+	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
+	 * allocate locally and it's possible that an index will be vacuumed by a
+	 * different vacuum process the next cycle.  Copying the result normally
+	 * happens only the first time an index is vacuumed.  For any additional
+	 * vacuum pass, we directly point to the result on the DSM segment and
+	 * pass it to vacuum index APIs so that workers can update it directly.
+	 *
+	 * Since all vacuum workers write the bulk-deletion result at different
+	 * slots we can write them without locking.
+	 */
+	if (shared_istat && !shared_istat->updated && istat != NULL)
+	{
+		memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+		shared_istat->updated = true;
+
+		/*
+		 * Now that top-level indstats[idx] points to the DSM segment, we
+		 * don't need the locally allocated results.
+		 */
+		pfree(istat);
+		istat = bulkdelete_res;
+	}
+
+	return istat;
+}
+
+/*
+ * Return shared memory statistics for index at offset 'getidx', if any
+ */
+static LVSharedIndStats *
+parallel_stats_for_idx(LVShared *lvshared, int getidx)
 {
-	int			i;
 	char	   *p;
 
-	if (IndStatsIsNull(lvshared, n))
+	if (IndStatsIsNull(lvshared, getidx))
 		return NULL;
 
 	p = (char *) GetSharedIndStats(lvshared);
-	for (i = 0; i < n; i++)
+	for (int idx = 0; idx < getidx; idx++)
 	{
-		if (IndStatsIsNull(lvshared, i))
+		if (IndStatsIsNull(lvshared, idx))
 			continue;
 
 		p += sizeof(LVSharedIndStats);
@@ -3526,11 +3594,11 @@ get_indstats(LVShared *lvshared, int n)
 }
 
 /*
- * Returns true, if the given index can't participate in parallel index vacuum
- * or parallel index cleanup, false, otherwise.
+ * Returns false, if the given index can't participate in parallel index
+ * vacuum or parallel index cleanup
  */
 static bool
-skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
+parallel_processing_is_safe(Relation indrel, LVShared *lvshared)
 {
 	uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
@@ -3552,15 +3620,15 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 		 */
 		if (!lvshared->first_time &&
 			((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
-			return true;
+			return false;
 	}
 	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0)
 	{
 		/* Skip if the index does not support parallel bulk deletion */
-		return true;
+		return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3580,7 +3648,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	LVRelStats	vacrelstats;
+	LVRelState	vacrel;
 	ErrorContextCallback errcallback;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3602,7 +3670,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * okay because the lock mode does not conflict among the parallel
 	 * workers.
 	 */
-	onerel = table_open(lvshared->relid, ShareUpdateExclusiveLock);
+	onerel = table_open(lvshared->onereloid, ShareUpdateExclusiveLock);
 
 	/*
 	 * Open all indexes. indrels are sorted in order by OID, which should be
@@ -3626,24 +3694,27 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
 	VacuumActiveNWorkers = &(lvshared->active_nworkers);
 
-	vacrelstats.indstats = (IndexBulkDeleteResult **)
+	vacrel.onerel = onerel;
+	vacrel.indrels = indrels;
+	vacrel.nindexes = nindexes;
+	vacrel.indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	if (lvshared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = lvshared->maintenance_work_mem_worker;
 
 	/*
-	 * Initialize vacrelstats for use as error callback arg by parallel
-	 * worker.
+	 * Initialize vacrel for use as error callback arg by parallel worker.
 	 */
-	vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats.relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats.indname = NULL;
-	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
+	vacrel.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel.relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_UNKNOWN;	/* Not yet processing */
+	vacrel.dead_tuples = dead_tuples;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = &vacrelstats;
+	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
@@ -3651,8 +3722,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_index(indrels, lvshared, dead_tuples, nindexes,
-						  &vacrelstats);
+	do_parallel_processing(&vacrel, lvshared);
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -3665,7 +3735,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
-	pfree(vacrelstats.indstats);
+	pfree(vacrel.indstats);
 }
 
 /*
@@ -3674,7 +3744,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 static void
 vacuum_error_callback(void *arg)
 {
-	LVRelStats *errinfo = arg;
+	LVRelState *errinfo = arg;
 
 	switch (errinfo->phase)
 	{
@@ -3736,28 +3806,29 @@ vacuum_error_callback(void *arg)
  * the current information which can be later restored via restore_vacuum_error_info.
  */
 static void
-update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info, int phase,
-						 BlockNumber blkno, OffsetNumber offnum)
+update_vacuum_error_info(LVRelState *vacrel, LVSavedErrInfo *saved_vacrel,
+						 int phase, BlockNumber blkno, OffsetNumber offnum)
 {
-	if (saved_err_info)
+	if (saved_vacrel)
 	{
-		saved_err_info->offnum = errinfo->offnum;
-		saved_err_info->blkno = errinfo->blkno;
-		saved_err_info->phase = errinfo->phase;
+		saved_vacrel->offnum = vacrel->offnum;
+		saved_vacrel->blkno = vacrel->blkno;
+		saved_vacrel->phase = vacrel->phase;
 	}
 
-	errinfo->blkno = blkno;
-	errinfo->offnum = offnum;
-	errinfo->phase = phase;
+	vacrel->blkno = blkno;
+	vacrel->offnum = offnum;
+	vacrel->phase = phase;
 }
 
 /*
  * Restores the vacuum information saved via a prior call to update_vacuum_error_info.
  */
 static void
-restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info)
+restore_vacuum_error_info(LVRelState *vacrel,
+						  const LVSavedErrInfo *saved_vacrel)
 {
-	errinfo->blkno = saved_err_info->blkno;
-	errinfo->offnum = saved_err_info->offnum;
-	errinfo->phase = saved_err_info->phase;
+	vacrel->blkno = saved_vacrel->blkno;
+	vacrel->offnum = saved_vacrel->offnum;
+	vacrel->phase = saved_vacrel->phase;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..9b5afa12ad 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -689,7 +689,7 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
  */
 IndexBulkDeleteResult *
 index_bulk_delete(IndexVacuumInfo *info,
-				  IndexBulkDeleteResult *stats,
+				  IndexBulkDeleteResult *istat,
 				  IndexBulkDeleteCallback callback,
 				  void *callback_state)
 {
@@ -698,7 +698,7 @@ index_bulk_delete(IndexVacuumInfo *info,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambulkdelete);
 
-	return indexRelation->rd_indam->ambulkdelete(info, stats,
+	return indexRelation->rd_indam->ambulkdelete(info, istat,
 												 callback, callback_state);
 }
 
@@ -710,14 +710,14 @@ index_bulk_delete(IndexVacuumInfo *info,
  */
 IndexBulkDeleteResult *
 index_vacuum_cleanup(IndexVacuumInfo *info,
-					 IndexBulkDeleteResult *stats)
+					 IndexBulkDeleteResult *istat)
 {
 	Relation	indexRelation = info->index;
 
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(amvacuumcleanup);
 
-	return indexRelation->rd_indam->amvacuumcleanup(info, stats);
+	return indexRelation->rd_indam->amvacuumcleanup(info, istat);
 }
 
 /* ----------------
-- 
2.27.0

v7-0002-Break-lazy_scan_heap-up-into-functions.patchapplication/octet-stream; name=v7-0002-Break-lazy_scan_heap-up-into-functions.patchDownload

From 6b3e361b499af572cf06f5639de8fd56862dfec1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 24 Mar 2021 20:53:54 -0700
Subject: [PATCH v7 2/4] Break lazy_scan_heap() up into functions.

Aside from being useful cleanup work in its own right, this is also
preparation for an upcoming patch that removes the "tupgone" special
case from vacuumlazy.c.

The INDEX_CLEANUP=off case no longer uses the one-pass code path used
when vacuuming a table with no indexes.  It doesn't make sense to think
of the two cases as equivalent because only the no-indexes case can do
heap vacuuming.  The INDEX_CLEANUP=off case is now structured as a
two-pass VACUUM that opts to not do index vacuuming (and so naturally
cannot safely perform heap vacuuming).
---
 src/backend/access/heap/vacuumlazy.c  | 1363 +++++++++++++++----------
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    9 +-
 3 files changed, 823 insertions(+), 557 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b5343d5d78..7c1047c745 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -305,10 +305,12 @@ typedef struct LVRelState
 	/* Parallel VACUUM state */
 	LVParallelState *lps;
 
+	/* Do index and/or heap vacuuming (don't skip them)? */
+	bool		do_index_vacuuming;
+	bool		do_index_cleanup;
+
 	char	   *relnamespace;
 	char	   *relname;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
 	/* Overall statistics about onerel */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
 	BlockNumber rel_pages;		/* total number of pages */
@@ -348,13 +350,63 @@ typedef struct LVSavedErrInfo
 
 static int	elevel = -1;
 
+/*
+ * Counters maintained by lazy_scan_heap() (and lazy_prune_page_items())
+ */
+typedef struct LVTempCounters
+{
+	double		num_tuples;		/* total number of nonremovable tuples */
+	double		live_tuples;	/* live tuples (reltuples estimate) */
+	double		tups_vacuumed;	/* tuples cleaned up by current vacuum */
+	double		nkeep;			/* dead-but-not-removable tuples */
+	double		nunused;		/* # existing unused line pointers */
+} LVTempCounters;
+
+/*
+ * State output by lazy_prune_page_items()
+ */
+typedef struct LVPagePruneState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_dead_items; /* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPagePruneState;
+
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * lazy_prune_page_items()) that represents VM bit status.
+ *
+ * Used by lazy_scan_setvmbit_page() when we're done pruning.
+ */
+typedef struct LVPageVisMapState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVPageVisMapState;
+
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
+static bool lazy_scan_needs_freeze(Buffer buf, bool *hastup,
+								   LVRelState *vacrel);
+static void lazy_scan_new_page(LVRelState *vacrel, Buffer buf);
+static void lazy_scan_empty_page(LVRelState *vacrel, Buffer buf,
+								 Buffer vmbuffer);
+static void lazy_scan_setvmbit_page(LVRelState *vacrel, Buffer buf,
+									Buffer vmbuffer,
+									LVPagePruneState *pageprunestate,
+									LVPageVisMapState *pagevmstate);
+static void lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
+								  GlobalVisState *vistest,
+								  xl_heap_freeze_tuple *frozen,
+								  LVTempCounters *scancounts,
+								  LVPagePruneState *pageprunestate,
+								  LVPageVisMapState *pagevmstate,
+								  VacOptTernaryValue index_cleanup);
+static void lazy_vacuum_all_pruned_items(LVRelState *vacrel);
 static void lazy_vacuum_heap(LVRelState *vacrel);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
@@ -378,7 +430,7 @@ static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
+static BlockNumber lazy_truncate_count_nondeletable(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
 static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
 							 BlockNumber relblocks);
@@ -502,6 +554,9 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 
 	vacrel->onerel = onerel;
 	vacrel->lps = NULL;
+	vacrel->do_index_vacuuming = true;
+	vacrel->do_index_cleanup = true;
+
 	vacrel->vac_strategy = bstrategy;
 	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
 	vacrel->relminmxid = onerel->rd_rel->relminmxid;
@@ -521,11 +576,20 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, &nindexes,
 					 &vacrel->indrels);
-	vacrel->useindex = (nindexes > 0 &&
-						params->index_cleanup == VACOPT_TERNARY_ENABLED);
 
 	vacrel->nindexes = nindexes;
 
+	/*
+	 * Determine if we should skip index vacuuming and cleanup based on user's
+	 * preference.  Note that this structured as orthogonal to the one-pass
+	 * (nindexes == 0) case to make various assertions do the right thing.
+	 */
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+	{
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+	}
+
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
@@ -802,9 +866,9 @@ vacuum_log_cleanup_info(LVRelState *vacrel)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke lazy_vacuum_all_pruned_items to vacuum
+ *		indexes, and then vacuum the heap during a second heap pass.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -827,20 +891,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
 	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
@@ -851,6 +906,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	};
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
+	LVTempCounters scancounts;
+
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
 
 	pg_rusage_init(&ru0);
 
@@ -866,8 +926,13 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+
+	/* Initialize counters */
+	scancounts.num_tuples = 0;
+	scancounts.live_tuples = 0;
+	scancounts.tups_vacuumed = 0;
+	scancounts.nkeep = 0;
+	scancounts.nunused = 0;
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
 	next_unskippable_block = 0;
@@ -972,20 +1037,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVPageVisMapState pagevmstate;
+		LVPagePruneState pageprunestate;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/*
+		 * Initialize vm state for page
+		 *
+		 * Can't touch pageprunestate for page until we reach
+		 * lazy_prune_page_items(), though -- that's output state only
+		 */
+		pagevmstate.all_visible_according_to_vm = false;
+		pagevmstate.visibility_cutoff_xid = InvalidTransactionId;
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
@@ -1038,7 +1108,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
 											 &vmbuffer))
-				all_visible_according_to_vm = true;
+				pagevmstate.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1066,12 +1136,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			pagevmstate.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1090,24 +1163,18 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(vacrel);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap(vacrel);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum_all_pruned_items(vacrel);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be
+			 * worthwhile since we could have updated the freespace of empty
+			 * pages.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+									blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1116,6 +1183,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
@@ -1128,9 +1197,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->vac_strategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1161,7 +1236,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+			if (!lazy_scan_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
 				vacrel->scanned_pages++;
@@ -1187,6 +1262,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrel->scanned_pages++;
 		vacrel->tupcount_pages++;
 
@@ -1194,401 +1275,92 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			lazy_scan_new_page(vacrel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			lazy_scan_empty_page(vacrel, buf, vmbuffer);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrel->latestRemovedXid,
-										 &vacrel->offnum);
+		lazy_prune_page_items(vacrel, buf, vistest, frozen, &scancounts,
+							  &pageprunestate, &pagevmstate,
+							  params->index_cleanup);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before lazy_prune_page_items()) then the
+		 * page will be visited again by lazy_vacuum_heap(), which will
+		 * compute and record its post-compaction free space.  If not, then
+		 * we're done with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (vacrel->nindexes > 0 && pageprunestate.has_dead_items &&
+			vacrel->do_index_vacuuming)
 		{
-			ItemId		itemid;
-
 			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrel->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
+			 * Wait until lazy_vacuum_heap() to save free space.
 			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap() -- lazy_vacuum_all_pruned_items() might opt
+			 * to skip index vacuuming (and so must skip heap vacuuming).
+			 * This is deemed okay because it only happens in emergencies.
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrel->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  vacrel->relfrozenxid,
-											  vacrel->relminmxid,
-											  vacrel->FreezeLimit,
-											  vacrel->MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrel->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(vacrel->onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(vacrel->onerel, buf,
-										 vacrel->FreezeLimit, frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/*
+			 * Will never reach lazy_vacuum_heap() (or will, but won't reach
+			 * this specific page)
+			 */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
+		if (vacrel->nindexes == 0 && pageprunestate.has_dead_items)
 		{
-			if (vacrel->nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(vacrel, blkno, buf, 0, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap().
 			 */
+			lazy_vacuum_page(vacrel, blkno, buf, 0, &vmbuffer);
+			vacuumed_pages++;
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure lazy_scan_setvmbit_page() won't stop setting VM due
+			 * to now-vacuumed LP_DEAD items:
+			 */
+			pageprunestate.has_dead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1599,115 +1371,34 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
-										blkno);
+				FreeSpaceMapVacuumRange(vacrel->onerel,
+										next_fsm_block_to_vacuum, blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		lazy_scan_setvmbit_page(vacrel, buf, vmbuffer, &pageprunestate,
+								&pagevmstate);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (pageprunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1719,13 +1410,13 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrel->tuples_deleted = tups_vacuumed;
-	vacrel->new_dead_tuples = nkeep;
+	vacrel->tuples_deleted = scancounts.tups_vacuumed;
+	vacrel->new_dead_tuples = scancounts.nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
 													 vacrel->tupcount_pages,
-													 live_tuples);
+													 scancounts.live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1744,50 +1435,49 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(vacrel);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap(vacrel);
-	}
+		lazy_vacuum_all_pruned_items(vacrel);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
+		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+								blkno);
 
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
 
 	/* Free resources managed by lazy_space_alloc() */
 	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
-	if (vacuumed_pages)
+	/*
+	 * If no indexes, make log report that lazy_vacuum_all_pruned_items()
+	 * would've made
+	 */
+	Assert(vacrel->nindexes == 0 || vacuumed_pages == 0);
+	if (vacrel->nindexes == 0)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrel->relname,
-						tups_vacuumed, vacuumed_pages)));
+						vacrel->relname, vacrel->tuples_deleted,
+						vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, vacrel->OldestXmin);
+					 scancounts.nkeep, vacrel->OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 scancounts.nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrel->pinskipped_pages),
@@ -1803,23 +1493,22 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-					vacrel->relname,
-					tups_vacuumed, num_tuples,
-					vacrel->scanned_pages, nblocks),
+			(errmsg("\"%s\": newly pruned %.0f items, found %.0f nonremovable items in %u out of %u pages",
+					vacrel->relname, scancounts.tups_vacuumed,
+					scancounts.num_tuples, vacrel->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
+ *	lazy_scan_needs_freeze() -- see if any tuples need to be cleaned to avoid
+ *	wraparound
  *
  * Returns true if the page needs to be vacuumed using cleanup lock.
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
+lazy_scan_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -1851,7 +1540,9 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
-		/* this should match hastup test in count_nondeletable_pages() */
+		/*
+		 * This should match hastup test in lazy_truncate_count_nondeletable()
+		 */
 		if (ItemIdIsUsed(itemid))
 			*hastup = true;
 
@@ -1872,6 +1563,574 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 	return (offnum <= maxoff);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+lazy_scan_new_page(LVRelState *vacrel, Buffer buf)
+{
+	Relation	onerel = vacrel->onerel;
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+lazy_scan_empty_page(LVRelState *vacrel, Buffer buf, Buffer vmbuffer)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see lazy_scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+lazy_scan_setvmbit_page(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
+						LVPagePruneState *pageprunestate,
+						LVPageVisMapState *pagevmstate)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (pageprunestate->all_visible &&
+		!pagevmstate->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (pageprunestate->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+						  pagevmstate->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel, blkno,
+													   &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && pageprunestate->has_dead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 pageprunestate->all_visible && pageprunestate->all_frozen &&
+			 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
+/*
+ *	lazy_prune_page_items() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+lazy_prune_page_items(LVRelState *vacrel, Buffer buf,
+					  GlobalVisState *vistest, xl_heap_freeze_tuple *frozen,
+					  LVTempCounters *scancounts,
+					  LVPagePruneState *pageprunestate,
+					  LVPageVisMapState *pagevmstate,
+					  VacOptTernaryValue index_cleanup)
+{
+	Relation	onerel = vacrel->onerel;
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	int			nfrozen,
+				ndead;
+	LVTempCounters pagecounts;
+	OffsetNumber deaditems[MaxHeapTuplesPerPage];
+	bool		tupgone;
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+	/* Initialize (or reset) page-level counters */
+	pagecounts.num_tuples = 0;
+	pagecounts.live_tuples = 0;
+	pagecounts.tups_vacuumed = 0;
+	pagecounts.nkeep = 0;
+	pagecounts.nunused = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	pagecounts.tups_vacuumed = heap_page_prune(onerel, buf, vistest,
+											   InvalidTransactionId, 0, false,
+											   &vacrel->latestRemovedXid,
+											   &vacrel->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	pageprunestate->hastup = false;
+	pageprunestate->has_dead_items = false;
+	pageprunestate->all_visible = true;
+	pageprunestate->all_frozen = true;
+	nfrozen = 0;
+	ndead = 0;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	tupgone = false;
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			pagecounts.nunused += 1;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			pageprunestate->hastup = true;	/* page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tups_vacuumed, else we'd be double-counting (at least
+		 * in the common case where heap_page_prune() just freed up a non-HOT
+		 * tuple).
+		 *
+		 * Note also that the final tups_vacuumed value might be very low for
+		 * tables where opportunistic page pruning happens to occur very
+		 * frequently (via heap_page_prune_opt() calls that free up non-HOT
+		 * tuples).
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deaditems[ndead++] = offnum;
+			pageprunestate->all_visible = false;
+			pageprunestate->has_dead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					pagecounts.nkeep += 1;
+				else
+					tupgone = true; /* we can delete the tuple */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				pagecounts.live_tuples += 1;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (pageprunestate->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin,
+											 pagevmstate->visibility_cutoff_xid))
+						pagevmstate->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				pagecounts.nkeep += 1;
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				pageprunestate->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				pagecounts.live_tuples += 1;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			deaditems[ndead++] = offnum;
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrel->latestRemovedXid);
+			pagecounts.tups_vacuumed += 1;
+			pageprunestate->has_dead_items = true;
+		}
+		else
+		{
+			bool		tuple_totally_frozen;
+
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  vacrel->relfrozenxid,
+										  vacrel->relminmxid,
+										  vacrel->FreezeLimit,
+										  vacrel->MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = offnum;
+
+			pagecounts.num_tuples += 1;
+			pageprunestate->hastup = true;
+
+			if (!tuple_totally_frozen)
+				pageprunestate->all_frozen = false;
+		}
+	}
+
+	/*
+	 * Success -- we're done pruning, and have determined which tuples are to
+	 * be recorded as dead in local array.  We've also prepared the details of
+	 * which remaining tuples are to be frozen.
+	 *
+	 * First clear the offset information once we have processed all the
+	 * tuples on the page.
+	 */
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Next add page level counters to caller's counts
+	 */
+	scancounts->num_tuples += pagecounts.num_tuples;
+	scancounts->live_tuples += pagecounts.live_tuples;
+	scancounts->tups_vacuumed += pagecounts.tups_vacuumed;
+	scancounts->nkeep += pagecounts.nkeep;
+	scancounts->nunused += pagecounts.nunused;
+
+	/*
+	 * Now save the local dead items array to VACUUM's dead_tuples array.
+	 */
+	for (int i = 0; i < ndead; i++)
+	{
+		ItemPointerData itemptr;
+
+		ItemPointerSet(&itemptr, blkno, deaditems[i]);
+		lazy_record_dead_tuple(vacrel->dead_tuples, &itemptr);
+	}
+
+	/*
+	 * Finally, execute tuple freezing as planned.
+	 *
+	 * If we need to freeze any tuples we'll mark the buffer dirty, and write
+	 * a WAL record recording the changes.  We must log the changes to be
+	 * crash-safe against future truncation of CLOG.
+	 */
+	if (nfrozen > 0)
+	{
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			ItemId		itemid;
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(onerel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(onerel, buf, vacrel->FreezeLimit, frozen,
+									 nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
+
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum_all_pruned_items(LVRelState *vacrel)
+{
+	/* Should not end up here with no indexes */
+	Assert(vacrel->nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	if (!vacrel->do_index_vacuuming)
+	{
+		/*
+		 * Just ignore second or subsequent calls in when INDEX_CLEANUP off
+		 * was specified
+		 */
+		Assert(!vacrel->do_index_cleanup);
+		vacrel->dead_tuples->num_tuples = 0;
+		return;
+	}
+
+	/* Okay, we're going to do index vacuuming */
+	lazy_vacuum_all_indexes(vacrel);
+
+	/* Remove tuples from heap */
+	lazy_vacuum_heap(vacrel);
+
+	/*
+	 * Forget the now-vacuumed tuples -- just press on
+	 */
+	vacrel->dead_tuples->num_tuples = 0;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  *
@@ -2106,6 +2365,9 @@ lazy_vacuum_heap(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
@@ -2185,6 +2447,8 @@ lazy_vacuum_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
+
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Update error traceback information */
@@ -2428,7 +2692,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(vacrel);
+		new_rel_pages = lazy_truncate_count_nondeletable(vacrel);
 		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
@@ -2477,7 +2741,7 @@ lazy_truncate_heap(LVRelState *vacrel)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(LVRelState *vacrel)
+lazy_truncate_count_nondeletable(LVRelState *vacrel)
 {
 	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
@@ -2817,7 +3081,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * lazy_scan_new_page. So if you change anything here, also check that
+	 * code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2863,7 +3128,7 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in lazy_scan_new_page() */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..6bfc48c64a 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/lazy_scan_new_page nor heap_page_is_all_visible
+	 * will mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..adf4a61aac 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap and
+ * lazy_scan_new_page, but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,9 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * lazy_scan_new_page, we can afford to ignore problems and special
+		 * cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
-- 
2.27.0

#89

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#88)

4 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Mar 25, 2021 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v7, which takes the last two patches from your v6 and
rebases them on top of my recent work.

And now here's v8, which has the following additional cleanup:

* Added useful log_autovacuum output.

This should provide DBAs with a useful tool for seeing how effective
this optimization is. But I think that they'll also end up using it to
monitor things like how effective HOT is with certain tables over
time. If regular autovacuums indicate that there is no need to do
index vacuuming, then HOT must be working well. Whereas if autovacuums
continually require index vacuuming, it might well be taken as a sign
that heap fill factor should be reduced. There are complicated reasons
why HOT might not work quite as well as expected, and having near real
time insight into it strikes me as valuable.

* Added this assertion to the patch that removes the tupgone special
case, which seems really useful to me:

@@ -2421,6 +2374,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
vmbuffer = InvalidBuffer;
}

+   /*
+    * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
+    * the second heap pass.  No more, no less.
+    */
+   Assert(vacrel->num_index_scans > 1 || tupindex == vacrel->lpdead_items);
+
    ereport(elevel,
            (errmsg("\"%s\": removed %d dead item identifiers in %u pages",
                    vacrel->relname, tupindex, vacuumed_pages),

This assertion verifies that the number of items that we have vacuumed
in a second pass of the heap precisely matches the number of LP_DEAD
items encountered in the first pass of the heap. Of course, these
LP_DEAD items are now exactly the same thing as dead_tuples array TIDs
that we vacuum/remove from indexes, before finally vacuuming/removing
them from the heap.

* A lot more polishing in the first patch, which refactors the
vacuumlazy.c state quite a bit. I now use int64 instead of double for
some of the counters, which enables various assertions, including the
one I mentioned.

The instrumentation state in vacuumlazy.c has always been a mess. I
spotted a bug in the process of cleaning it up, at this point:

/* If no indexes, make log report that lazy_vacuum_heap would've made */
if (vacuumed_pages)
ereport(elevel,
(errmsg("\"%s\": removed %.0f row versions in %u pages",
vacrelstats->relname,
tups_vacuumed, vacuumed_pages)));

This is wrong because lazy_vacuum_heap() doesn't report tups_vacuumed.
It actually reports what I'm calling lpdead_items, which can have a
very different value to tups_vacuumed/tuples_deleted.

--
Peter Geoghegan

Attachments:

v8-0001-Centralize-state-for-each-VACUUM.patchapplication/octet-stream; name=v8-0001-Centralize-state-for-each-VACUUM.patchDownload

From 12b730bd2a18264421621d4d3afb64d3f732851d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:54 -0700
Subject: [PATCH v8 1/4] Centralize state for each VACUUM.

Simplify function signatures inside vacuumlazy.c by putting several
frequently used variables in a per-VACUUM state variable.  This makes
the general control flow easier to follow, and reduces clutter.

Also refactor the parallel VACUUM code.
---
 src/include/access/genam.h           |    4 +-
 src/backend/access/heap/vacuumlazy.c | 2204 +++++++++++++-------------
 src/backend/access/index/indexam.c   |    8 +-
 3 files changed, 1144 insertions(+), 1072 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..480a4762f5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,11 +177,11 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
-												IndexBulkDeleteResult *stats,
+												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
 												void *callback_state);
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
-												   IndexBulkDeleteResult *stats);
+												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..9c1cfe42e1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -142,12 +142,6 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 
-/*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-
 /* Phases of vacuum during which we report error context. */
 typedef enum
 {
@@ -160,9 +154,10 @@ typedef enum
 } VacErrPhase;
 
 /*
- * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
- * This is allocated in the DSM segment in parallel mode and in local memory
- * in non-parallel mode.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
  */
 typedef struct LVDeadTuples
 {
@@ -191,7 +186,7 @@ typedef struct LVShared
 	 * Target table relid and log level.  These fields are not modified during
 	 * the lazy vacuum.
 	 */
-	Oid			relid;
+	Oid			onereloid;
 	int			elevel;
 
 	/*
@@ -264,7 +259,7 @@ typedef struct LVShared
 typedef struct LVSharedIndStats
 {
 	bool		updated;		/* are the stats updated? */
-	IndexBulkDeleteResult stats;
+	IndexBulkDeleteResult istat;
 } LVSharedIndStats;
 
 /* Struct for maintaining a parallel vacuum state. */
@@ -290,41 +285,71 @@ typedef struct LVParallelState
 	int			nindexes_parallel_condcleanup;
 } LVParallelState;
 
-typedef struct LVRelStats
+typedef struct LVRelState
 {
-	char	   *relnamespace;
-	char	   *relname;
+	/* Target heap relation and its indexes */
+	Relation	onerel;
+	Relation   *indrels;
+	int			nindexes;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
-	/* Overall statistics about rel */
+
+	/* Buffer access strategy and parallel state */
+	BufferAccessStrategy bstrategy;
+	LVParallelState *lps;
+
+	/* Statistics from pg_class when we start out */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages we skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
 	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
-	double		new_dead_tuples;	/* new estimated total # of dead tuples */
-	BlockNumber pages_removed;
-	double		tuples_deleted;
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-	LVDeadTuples *dead_tuples;
-	int			num_index_scans;
+	/* onerel's initial relfrozenxid and relminmxid */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
 	TransactionId latestRemovedXid;
-	bool		lock_waiter_detected;
 
-	/* Statistics about indexes */
-	IndexBulkDeleteResult **indstats;
-	int			nindexes;
+	/* VACUUM operation's cutoff for pruning */
+	TransactionId OldestXmin;
+	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
-	/* Used for error callback */
+	/* Error reporting state */
+	char	   *relnamespace;
+	char	   *relname;
 	char	   *indname;
 	BlockNumber blkno;			/* used only for heap operations */
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
-} LVRelStats;
+
+	/*
+	 * State managed by lazy_scan_heap() follows
+	 */
+	LVDeadTuples *dead_tuples;	/* items to vacuum from indexes */
+	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber scanned_pages;	/* number of pages we examined */
+	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
+	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
+	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber pages_removed;	/* pages remove by truncation */
+	BlockNumber lpdead_item_pages;	/* total number of pages with dead items */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	bool		lock_waiter_detected;
+
+	/* Statistics output by us, for table */
+	double		new_rel_tuples; /* new estimated total # of tuples */
+	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* Statistics output by index AMs */
+	IndexBulkDeleteResult **indstats;
+
+	/* Instrumentation counters */
+	int			num_index_scans;
+	int64		tuples_deleted; /* # deleted from table */
+	int64		lpdead_items;	/* # deleted from indexes */
+	int64		new_dead_tuples;	/* new estimated total # of dead items in
+									 * table */
+	int64		num_tuples;		/* total number of nonremovable tuples */
+	int64		live_tuples;	/* live tuples (reltuples estimate) */
+	int64		nunused;		/* # existing unused line pointers */
+} LVRelState;
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -334,77 +359,72 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
-/* A few variables that don't seem worth passing around as parameters */
+/* elevel controls whole VACUUM's verbosity */
 static int	elevel = -1;
 
-static TransactionId OldestXmin;
-static TransactionId FreezeLimit;
-static MultiXactId MultiXactCutoff;
-
-static BufferAccessStrategy vac_strategy;
-
 
 /* non-export function prototypes */
-static void lazy_scan_heap(Relation onerel, VacuumParams *params,
-						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
-static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_index(Relation indrel,
-							   IndexBulkDeleteResult **stats,
-							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-							 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
-static bool should_attempt_truncation(VacuumParams *params,
-									  LVRelStats *vacrelstats);
-static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
-static BlockNumber count_nondeletable_pages(Relation onerel,
-											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+									LVRelState *vacrel);
+static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
+													IndexBulkDeleteResult *istat,
+													double reltuples,
+													LVRelState *vacrel);
+static void lazy_cleanup_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_cleanup_one_index(Relation indrel,
+													 IndexBulkDeleteResult *istat,
+													 double reltuples,
+													 bool estimated_count,
+													 LVRelState *vacrel);
+static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, int tupindex, Buffer *vmbuffer);
+static void update_index_statistics(LVRelState *vacrel);
+static bool should_attempt_truncation(LVRelState *vacrel,
+									  VacuumParams *params);
+static void lazy_truncate_heap(LVRelState *vacrel);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 LVRelStats *vacrelstats,
+static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static void lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-										 LVParallelState *lps, int nindexes);
-static void parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-								  LVDeadTuples *dead_tuples, int nindexes,
-								  LVRelStats *vacrelstats);
-static void vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-								  LVParallelState *lps, int nindexes);
-static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-									 LVParallelState *lps, int nindexes);
+static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
-static int	compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber relblocks);
+static void lazy_space_free(LVRelState *vacrel);
+static int	compute_parallel_vacuum_workers(LVRelState *vacrel,
+											int nrequested,
 											bool *can_parallel_vacuum);
-static void prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-									 int nindexes);
-static void update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-									int nindexes);
-static LVParallelState *begin_parallel_vacuum(Oid relid, Relation *Irel,
-											  LVRelStats *vacrelstats, BlockNumber nblocks,
-											  int nindexes, int nrequested);
-static void end_parallel_vacuum(IndexBulkDeleteResult **stats,
-								LVParallelState *lps, int nindexes);
-static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
-static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static LVParallelState *begin_parallel_vacuum(LVRelState *vacrel,
+											  BlockNumber nblocks,
+											  int nrequested);
+static void end_parallel_vacuum(LVRelState *vacrel);
+static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
+static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
+static void do_parallel_processing(LVRelState *vacrel,
+								   LVShared *lvshared);
+static void do_serial_processing_for_unsafe_indexes(LVRelState *vacrel,
+													LVShared *lvshared);
+static IndexBulkDeleteResult *parallel_process_one_index(Relation indrel,
+														 IndexBulkDeleteResult *istat,
+														 LVShared *lvshared,
+														 LVSharedIndStats *shared_indstats,
+														 LVRelState *vacrel);
+static LVSharedIndStats *parallel_stats_for_idx(LVShared *lvshared, int getidx);
+static bool parallel_processing_is_safe(Relation indrel, LVShared *lvshared);
 static void vacuum_error_callback(void *arg);
-static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info,
+static void update_vacuum_error_info(LVRelState *vacrel,
+									 LVSavedErrInfo *saved_vacrel,
 									 int phase, BlockNumber blkno,
 									 OffsetNumber offnum);
-static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info);
+static void restore_vacuum_error_info(LVRelState *vacrel,
+									  const LVSavedErrInfo *saved_vacrel);
 
 
 /*
@@ -420,9 +440,7 @@ void
 heap_vacuum_rel(Relation onerel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
-	LVRelStats *vacrelstats;
-	Relation   *Irel;
-	int			nindexes;
+	LVRelState *vacrel;
 	PGRUsage	ru0;
 	TimestampTz starttime = 0;
 	WalUsage	walusage_start = pgWalUsage;
@@ -444,15 +462,14 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
+	TransactionId OldestXmin;
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -473,8 +490,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(onerel));
 
-	vac_strategy = bstrategy;
-
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -496,35 +511,40 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
-	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
+	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
-	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats->relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats->indname = NULL;
-	vacrelstats->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
-	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
-	vacrelstats->old_live_tuples = onerel->rd_rel->reltuples;
-	vacrelstats->num_index_scans = 0;
-	vacrelstats->pages_removed = 0;
-	vacrelstats->lock_waiter_detected = false;
+	/* Set up high level stuff about onerel */
+	vacrel->onerel = onerel;
+	vac_open_indexes(vacrel->onerel, RowExclusiveLock, &vacrel->nindexes,
+					 &vacrel->indrels);
+	vacrel->useindex = (vacrel->nindexes > 0 &&
+						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->bstrategy = bstrategy;
+	vacrel->lps = NULL;			/* for now */
+	vacrel->old_rel_pages = onerel->rd_rel->relpages;
+	vacrel->old_live_tuples = onerel->rd_rel->reltuples;
+	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
+	vacrel->relminmxid = onerel->rd_rel->relminmxid;
+	vacrel->latestRemovedXid = InvalidTransactionId;
 
-	/* Open all indexes of the relation */
-	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	/* Set cutoffs for entire VACUUM */
+	vacrel->OldestXmin = OldestXmin;
+	vacrel->FreezeLimit = FreezeLimit;
+	vacrel->MultiXactCutoff = MultiXactCutoff;
 
-	vacrelstats->indstats = (IndexBulkDeleteResult **)
-		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
-	vacrelstats->nindexes = nindexes;
+	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel->relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 
 	/* Save index names iff autovacuum logging requires it */
-	if (IsAutoVacuumWorkerProcess() &&
-		params->log_min_duration >= 0 &&
-		vacrelstats->nindexes > 0)
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0 &&
+		vacrel->nindexes > 0)
 	{
-		indnames = palloc(sizeof(char *) * vacrelstats->nindexes);
-		for (int i = 0; i < vacrelstats->nindexes; i++)
-			indnames[i] = pstrdup(RelationGetRelationName(Irel[i]));
+		indnames = palloc(sizeof(char *) * vacrel->nindexes);
+		for (int i = 0; i < vacrel->nindexes; i++)
+			indnames[i] =
+				pstrdup(RelationGetRelationName(vacrel->indrels[i]));
 	}
 
 	/*
@@ -539,15 +559,15 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * information is restored at the end of those phases.
 	 */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = vacrelstats;
+	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(vacrel, params, aggressive);
 
 	/* Done with indexes */
-	vac_close_indexes(nindexes, Irel, NoLock);
+	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
 	 * Compute whether we actually scanned the all unfrozen pages. If we did,
@@ -556,8 +576,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
 	 */
-	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
-		< vacrelstats->rel_pages)
+	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
+		< vacrel->rel_pages)
 	{
 		Assert(!aggressive);
 		scanned_all_unfrozen = false;
@@ -568,17 +588,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/*
 	 * Optionally truncate the relation.
 	 */
-	if (should_attempt_truncation(params, vacrelstats))
+	if (should_attempt_truncation(vacrel, params))
 	{
 		/*
 		 * Update error traceback information.  This is the last phase during
 		 * which we add context information to errors, so we don't need to
 		 * revert to the previous phase.
 		 */
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-								 vacrelstats->nonempty_pages,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
+								 vacrel->nonempty_pages,
 								 InvalidOffsetNumber);
-		lazy_truncate_heap(onerel, vacrelstats);
+		lazy_truncate_heap(vacrel);
 	}
 
 	/* Pop the error context stack */
@@ -602,8 +622,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
 	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrelstats->rel_pages;
-	new_live_tuples = vacrelstats->new_live_tuples;
+	new_rel_pages = vacrel->rel_pages;
+	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(onerel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
@@ -616,7 +636,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 						new_rel_pages,
 						new_live_tuples,
 						new_rel_allvisible,
-						nindexes > 0,
+						vacrel->nindexes > 0,
 						new_frozen_xid,
 						new_min_multi,
 						false);
@@ -625,7 +645,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(onerel),
 						 onerel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrelstats->new_dead_tuples);
+						 vacrel->new_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -676,39 +696,39 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			}
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
-							 vacrelstats->relnamespace,
-							 vacrelstats->relname,
-							 vacrelstats->num_index_scans);
+							 vacrel->relnamespace,
+							 vacrel->relname,
+							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
-							 vacrelstats->pages_removed,
-							 vacrelstats->rel_pages,
-							 vacrelstats->pinskipped_pages,
-							 vacrelstats->frozenskipped_pages);
+							 vacrel->pages_removed,
+							 vacrel->rel_pages,
+							 vacrel->pinskipped_pages,
+							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"),
-							 vacrelstats->tuples_deleted,
-							 vacrelstats->new_rel_tuples,
-							 vacrelstats->new_dead_tuples,
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->new_rel_tuples,
+							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
 			appendStringInfo(&buf,
 							 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
-			for (int i = 0; i < vacrelstats->nindexes; i++)
+			for (int i = 0; i < vacrel->nindexes; i++)
 			{
-				IndexBulkDeleteResult *stats = vacrelstats->indstats[i];
+				IndexBulkDeleteResult *istat = vacrel->indstats[i];
 
-				if (!stats)
+				if (!istat)
 					continue;
 
 				appendStringInfo(&buf,
 								 _("index \"%s\": pages: %u in total, %u newly deleted, %u currently deleted, %u reusable\n"),
 								 indnames[i],
-								 stats->num_pages,
-								 stats->pages_newly_deleted,
-								 stats->pages_deleted,
-								 stats->pages_free);
+								 istat->num_pages,
+								 istat->pages_newly_deleted,
+								 istat->pages_deleted,
+								 istat->pages_free);
 			}
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 							 read_rate, write_rate);
@@ -737,10 +757,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/* Cleanup index statistics and index names */
-	for (int i = 0; i < vacrelstats->nindexes; i++)
+	for (int i = 0; i < vacrel->nindexes; i++)
 	{
-		if (vacrelstats->indstats[i])
-			pfree(vacrelstats->indstats[i]);
+		if (vacrel->indstats[i])
+			pfree(vacrel->indstats[i]);
 
 		if (indnames && indnames[i])
 			pfree(indnames[i]);
@@ -764,20 +784,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  * which would be after the rows have become inaccessible.
  */
 static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+vacuum_log_cleanup_info(LVRelState *vacrel)
 {
 	/*
 	 * Skip this for relations for which no WAL is to be written, or if we're
 	 * not trying to support archive recovery.
 	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
+	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
 		return;
 
 	/*
 	 * No need to write the record at all unless it contains a valid value
 	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+	if (TransactionIdIsValid(vacrel->latestRemovedXid))
+		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
+									 vacrel->latestRemovedXid);
 }
 
 /*
@@ -788,9 +809,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
+ *		pointers.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -809,16 +830,12 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		reference them have been killed.
  */
 static void
-lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
-	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
@@ -847,63 +864,51 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 	else
 		ereport(elevel,
 				(errmsg("vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
-	nblocks = RelationGetNumberOfBlocks(onerel);
-	vacrelstats->rel_pages = nblocks;
-	vacrelstats->scanned_pages = 0;
-	vacrelstats->tupcount_pages = 0;
-	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
+	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
+	vacrel->rel_pages = nblocks;
+	vacrel->scanned_pages = 0;
+	vacrel->pinskipped_pages = 0;
+	vacrel->frozenskipped_pages = 0;
+	vacrel->tupcount_pages = 0;
+	vacrel->pages_removed = 0;
+	vacrel->lpdead_item_pages = 0;
+	vacrel->nonempty_pages = 0;
+	vacrel->lock_waiter_detected = false;
 
-	vistest = GlobalVisTestFor(onerel);
+	/* Initialize instrumentation counters */
+	vacrel->num_index_scans = 0;
+	vacrel->tuples_deleted = 0;
+	vacrel->lpdead_items = 0;
+	vacrel->new_dead_tuples = 0;
+	vacrel->num_tuples = 0;
+	vacrel->live_tuples = 0;
+	vacrel->nunused = 0;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
-	{
-		/*
-		 * Since parallel workers cannot access data in temporary tables, we
-		 * can't perform parallel vacuum on them.
-		 */
-		if (RelationUsesLocalBuffers(onerel))
-		{
-			/*
-			 * Give warning only if the user explicitly tries to perform a
-			 * parallel vacuum on the temporary table.
-			 */
-			if (params->nworkers > 0)
-				ereport(WARNING,
-						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
-								vacrelstats->relname)));
-		}
-		else
-			lps = begin_parallel_vacuum(RelationGetRelid(onerel), Irel,
-										vacrelstats, nblocks, nindexes,
-										params->nworkers);
-	}
+	vistest = GlobalVisTestFor(vacrel->onerel);
+
+	vacrel->indstats = (IndexBulkDeleteResult **)
+		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
 	/*
 	 * Allocate the space for dead tuples in case parallel vacuum is not
 	 * initialized.
 	 */
-	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
-
-	dead_tuples = vacrelstats->dead_tuples;
+	lazy_space_alloc(vacrel, params->nworkers, nblocks);
+	dead_tuples = vacrel->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
@@ -956,14 +961,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
 		{
 			uint8		vmstatus;
 
-			vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+			vmstatus = visibilitymap_get_status(vacrel->onerel,
+												next_unskippable_block,
 												&vmbuffer);
 			if (aggressive)
 			{
@@ -1004,11 +1009,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* see note above about forcing scanning of last page */
 #define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
+		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
 		if (blkno == next_unskippable_block)
@@ -1021,7 +1026,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				{
 					uint8		vmskipflags;
 
-					vmskipflags = visibilitymap_get_status(onerel,
+					vmskipflags = visibilitymap_get_status(vacrel->onerel,
 														   next_unskippable_block,
 														   &vmbuffer);
 					if (aggressive)
@@ -1053,7 +1058,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
+											 &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
@@ -1077,8 +1083,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * know whether it was all-frozen, so we have to recheck; but
 				 * in this case an approximate answer is OK.
 				 */
-				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-					vacrelstats->frozenskipped_pages++;
+				if (aggressive || VM_ALL_FROZEN(vacrel->onerel, blkno,
+												&vmbuffer))
+					vacrel->frozenskipped_pages++;
 				continue;
 			}
 			all_visible_according_to_vm = true;
@@ -1106,10 +1113,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+			lazy_vacuum_all_indexes(vacrel);
 
 			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
+			lazy_vacuum_heap_rel(vacrel);
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1122,7 +1129,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1137,12 +1144,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
-		visibilitymap_pin(onerel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->onerel, blkno, &vmbuffer);
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, vacrel->bstrategy);
 
 		/* We need buffer cleanup lock so that we can prune HOT chains. */
 		if (!ConditionalLockBufferForCleanup(buf))
@@ -1156,7 +1162,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			if (!aggressive && !FORCE_CHECK_PAGE())
 			{
 				ReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				continue;
 			}
 
@@ -1177,13 +1183,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrelstats))
+			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
-				vacrelstats->scanned_pages++;
-				vacrelstats->pinskipped_pages++;
+				vacrel->scanned_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			if (!aggressive)
@@ -1193,9 +1199,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * to claiming that the page contains no freezable tuples.
 				 */
 				UnlockReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -1203,8 +1209,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
-		vacrelstats->scanned_pages++;
-		vacrelstats->tupcount_pages++;
+		vacrel->scanned_pages++;
+		vacrel->tupcount_pages++;
 
 		page = BufferGetPage(buf);
 
@@ -1233,12 +1239,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			empty_pages++;
 
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
+			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
 			{
 				Size		freespace;
 
 				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
+				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			}
 			continue;
 		}
@@ -1269,19 +1275,19 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (RelationNeedsWAL(vacrel->onerel) &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 								  vmbuffer, InvalidTransactionId,
 								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
 				END_CRIT_SECTION();
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			continue;
 		}
 
@@ -1291,10 +1297,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * We count tuples removed by the pruning step as removed by VACUUM
 		 * (existing LP_DEAD line pointers don't count).
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
+		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
 										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+										 &vacrel->latestRemovedXid,
+										 &vacrel->offnum);
 
 		/*
 		 * Now scan the page to collect vacuumable items and check for tuples
@@ -1321,7 +1327,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Set the offset number so that we can display it along with any
 			 * error that occurred while processing this tuple.
 			 */
-			vacrelstats->offnum = offnum;
+			vacrel->offnum = offnum;
 			itemid = PageGetItemId(page, offnum);
 
 			/* Unused items require no processing, but we count 'em */
@@ -1361,7 +1367,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
+			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
 			tupgone = false;
 
@@ -1376,7 +1382,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * cases impossible (e.g. in-progress insert from the same
 			 * transaction).
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 			{
 				case HEAPTUPLE_DEAD:
 
@@ -1446,7 +1452,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						 * enough that everyone sees it as committed?
 						 */
 						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
+						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 						{
 							all_visible = false;
 							break;
@@ -1500,7 +1506,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			{
 				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
 				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
+													   &vacrel->latestRemovedXid);
 				tups_vacuumed += 1;
 				has_dead_items = true;
 			}
@@ -1516,8 +1522,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
 				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
+											  vacrel->relfrozenxid,
+											  vacrel->relminmxid,
+											  vacrel->FreezeLimit,
+											  vacrel->MultiXactCutoff,
 											  &frozen[nfrozen],
 											  &tuple_totally_frozen))
 					frozen[nfrozen++].offset = offnum;
@@ -1531,7 +1539,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * Clear the offset information once we have processed all the tuples
 		 * on the page.
 		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
+		vacrel->offnum = InvalidOffsetNumber;
 
 		/*
 		 * If we froze any tuples, mark the buffer dirty, and write a WAL
@@ -1557,12 +1565,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (RelationNeedsWAL(vacrel->onerel))
 			{
 				XLogRecPtr	recptr;
 
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
+				recptr = log_heap_freeze(vacrel->onerel, buf,
+										 vacrel->FreezeLimit, frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
 
@@ -1574,12 +1582,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * doing a second scan. Also we don't do that but forget dead tuples
 		 * when index cleanup is disabled.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
+			if (vacrel->nindexes == 0)
 			{
 				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
 				vacuumed_pages++;
 				has_dead_items = false;
 			}
@@ -1589,11 +1597,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * Here, we have indexes but index cleanup is disabled.
 				 * Instead of vacuuming the dead tuples on the heap, we just
 				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
 				 */
 				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
 			}
@@ -1613,7 +1616,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
@@ -1644,7 +1647,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, visibility_cutoff_xid, flags);
 		}
 
@@ -1656,11 +1659,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * that something bad has happened.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+				 vacrel->relname, blkno);
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1682,10 +1685,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		else if (PageIsAllVisible(page) && has_dead_items)
 		{
 			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
+				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1695,14 +1698,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * all_visible is true, so we must check both.
 		 */
 		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
 			 * because setting the all-frozen bit doesn't cause recovery
 			 * conflicts.
 			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1711,43 +1714,42 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (hastup)
-			vacrelstats->nonempty_pages = blkno + 1;
+			vacrel->nonempty_pages = blkno + 1;
 
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
+		 * visited again by lazy_vacuum_heap_rel, which will compute and record
 		 * its post-compaction free space.  If not, then we're done with this
 		 * page, so remember its free space as-is.  (This path will always be
 		 * taken if there are no indexes.)
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 	}
 
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
+	vacrel->blkno = InvalidBlockNumber;
 
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrel->tuples_deleted = tups_vacuumed;
+	vacrel->new_dead_tuples = nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
-	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
-														  nblocks,
-														  vacrelstats->tupcount_pages,
-														  live_tuples);
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
+													 vacrel->tupcount_pages,
+													 live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
-	vacrelstats->new_rel_tuples =
-		Max(vacrelstats->new_live_tuples, 0) + vacrelstats->new_dead_tuples;
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1763,10 +1765,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (dead_tuples->num_tuples > 0)
 	{
 		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+		lazy_vacuum_all_indexes(vacrel);
 
 		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
+		lazy_vacuum_heap_rel(vacrel);
 	}
 
 	/*
@@ -1774,47 +1776,44 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * not there were indexes.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+								blkno);
 
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
-		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
+	if (vacrel->useindex)
+		lazy_cleanup_all_indexes(vacrel);
 
-	/*
-	 * End parallel mode before updating index statistics as we cannot write
-	 * during parallel mode.
-	 */
-	if (ParallelVacuumIsActive(lps))
-		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
+	/* Free resources managed by lazy_space_alloc() */
+	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrelstats->useindex)
-		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
+	if (vacrel->useindex)
+		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
+	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						tups_vacuumed, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 nkeep, vacrel->OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
 					 nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
-									vacrelstats->pinskipped_pages),
-					 vacrelstats->pinskipped_pages);
+									vacrel->pinskipped_pages),
+					 vacrel->pinskipped_pages);
 	appendStringInfo(&buf, ngettext("%u frozen page.\n",
 									"%u frozen pages.\n",
-									vacrelstats->frozenskipped_pages),
-					 vacrelstats->frozenskipped_pages);
+									vacrel->frozenskipped_pages),
+					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
@@ -1823,258 +1822,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-					vacrelstats->relname,
+					vacrel->relname,
 					tups_vacuumed, num_tuples,
-					vacrelstats->scanned_pages, nblocks),
+					vacrel->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
-/*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
- *
- * We process the indexes serially unless we are doing parallel vacuum.
- */
-static void
-lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
-{
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
-
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
-	/* Report that we are now vacuuming indexes */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
-
-	/* Perform index vacuuming with parallel workers for parallel vacuum. */
-	if (ParallelVacuumIsActive(lps))
-	{
-		/* Tell parallel workers to do index vacuuming */
-		lps->lvshared->for_cleanup = false;
-		lps->lvshared->first_time = false;
-
-		/*
-		 * We can only provide an approximate value of num_heap_tuples in
-		 * vacuum cases.
-		 */
-		lps->lvshared->reltuples = vacrelstats->old_live_tuples;
-		lps->lvshared->estimated_count = true;
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
-	}
-	else
-	{
-		int			idx;
-
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							  vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
-	}
-
-	/* Increase and report the number of index scans */
-	vacrelstats->num_index_scans++;
-	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
-								 vacrelstats->num_index_scans);
-}
-
-
-/*
- *	lazy_vacuum_heap() -- second pass over the heap
- *
- *		This routine marks dead tuples as unused and compacts out free
- *		space on their pages.  Pages not having dead tuples recorded from
- *		lazy_scan_heap are not visited at all.
- *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
- */
-static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
-{
-	int			tupindex;
-	int			npages;
-	PGRUsage	ru0;
-	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
-
-	pg_rusage_init(&ru0);
-	npages = 0;
-
-	tupindex = 0;
-	while (tupindex < vacrelstats->dead_tuples->num_tuples)
-	{
-		BlockNumber tblk;
-		Buffer		buf;
-		Page		page;
-		Size		freespace;
-
-		vacuum_delay_point();
-
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]);
-		vacrelstats->blkno = tblk;
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
-								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
-									&vmbuffer);
-
-		/* Now that we've compacted the page, record its available space */
-		page = BufferGetPage(buf);
-		freespace = PageGetHeapFreeSpace(page);
-
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace);
-		npages++;
-	}
-
-	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
-
-	if (BufferIsValid(vmbuffer))
-	{
-		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	ereport(elevel,
-			(errmsg("\"%s\": removed %d row versions in %d pages",
-					vacrelstats->relname,
-					tupindex, npages),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-}
-
-/*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
- *
- * Caller must hold pin and buffer cleanup lock on the buffer.
- *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
- * tuple for this page.  We assume the rest follow sequentially.
- * The return value is the first tupindex after the tuples of this page.
- */
-static int
-lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
-{
-	LVDeadTuples *dead_tuples = vacrelstats->dead_tuples;
-	Page		page = BufferGetPage(buffer);
-	OffsetNumber unused[MaxOffsetNumber];
-	int			uncnt = 0;
-	TransactionId visibility_cutoff_xid;
-	bool		all_frozen;
-	LVSavedErrInfo saved_err_info;
-
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 blkno, InvalidOffsetNumber);
-
-	START_CRIT_SECTION();
-
-	for (; tupindex < dead_tuples->num_tuples; tupindex++)
-	{
-		BlockNumber tblk;
-		OffsetNumber toff;
-		ItemId		itemid;
-
-		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
-		itemid = PageGetItemId(page, toff);
-		ItemIdSetUnused(itemid);
-		unused[uncnt++] = toff;
-	}
-
-	PageRepairFragmentation(page);
-
-	/*
-	 * Mark buffer dirty before we write WAL.
-	 */
-	MarkBufferDirty(buffer);
-
-	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
-	{
-		XLogRecPtr	recptr;
-
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
-		PageSetLSN(page, recptr);
-	}
-
-	/*
-	 * End critical section, so we safely can do visibility tests (which
-	 * possibly need to perform IO and allocate memory!). If we crash now the
-	 * page (including the corresponding vm bit) might not be marked all
-	 * visible, but that's fine. A later vacuum will fix that.
-	 */
-	END_CRIT_SECTION();
-
-	/*
-	 * Now that we have removed the dead tuples from the page, once again
-	 * check if the page has become all-visible.  The page is already marked
-	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
-	 */
-	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
-								 &visibility_cutoff_xid,
-								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
-	{
-		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
-		uint8		flags = 0;
-
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
-			flags |= VISIBILITYMAP_ALL_FROZEN;
-
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
-	}
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	return tupindex;
-}
-
 /*
  *	lazy_check_needs_freeze() -- scan page to see if any tuples
  *					 need to be cleaned to avoid wraparound
@@ -2083,7 +1837,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
+lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -2112,7 +1866,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* this should match hastup test in count_nondeletable_pages() */
@@ -2125,363 +1879,72 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 
-		if (heap_tuple_needs_freeze(tupleheader, FreezeLimit,
-									MultiXactCutoff, buf))
+		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
 			break;
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return (offnum <= maxoff);
 }
 
 /*
- * Perform index vacuum or index cleanup with parallel workers.  This function
- * must be used by the parallel vacuum leader process.  The caller must set
- * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
- * cleanup.
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
 static void
-lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-							 LVParallelState *lps, int nindexes)
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-	int			nworkers;
+	Assert(vacrel->nindexes > 0);
+	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	Assert(!IsParallelWorker());
-	Assert(ParallelVacuumIsActive(lps));
-	Assert(nindexes > 0);
+	/* Log cleanup info before we touch indexes */
+	vacuum_log_cleanup_info(vacrel);
 
-	/* Determine the number of parallel workers to launch */
-	if (lps->lvshared->for_cleanup)
-	{
-		if (lps->lvshared->first_time)
-			nworkers = lps->nindexes_parallel_cleanup +
-				lps->nindexes_parallel_condcleanup;
-		else
-			nworkers = lps->nindexes_parallel_cleanup;
-	}
-	else
-		nworkers = lps->nindexes_parallel_bulkdel;
-
-	/* The leader process will participate */
-	nworkers--;
-
-	/*
-	 * It is possible that parallel context is initialized with fewer workers
-	 * than the number of indexes that need a separate worker in the current
-	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
-	 */
-	nworkers = Min(nworkers, lps->pcxt->nworkers);
-
-	/* Setup the shared cost-based vacuum delay and launch workers */
-	if (nworkers > 0)
-	{
-		if (vacrelstats->num_index_scans > 0)
-		{
-			/* Reset the parallel index processing counter */
-			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
-
-			/* Reinitialize the parallel context to relaunch parallel workers */
-			ReinitializeParallelDSM(lps->pcxt);
-		}
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(lps->pcxt, nworkers);
-
-		LaunchParallelWorkers(lps->pcxt);
-
-		if (lps->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
-			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
-		}
-
-		if (lps->lvshared->for_cleanup)
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
-									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-		else
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
-									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-	}
-
-	/* Process the indexes that can be processed by only leader process */
-	vacuum_indexes_leader(Irel, vacrelstats, lps, nindexes);
-
-	/*
-	 * Join as a parallel worker.  The leader process alone processes all the
-	 * indexes in the case where no workers are launched.
-	 */
-	parallel_vacuum_index(Irel, lps->lvshared, vacrelstats->dead_tuples,
-						  nindexes, vacrelstats);
-
-	/*
-	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
-	 * to finish, or we might get incomplete data.)
-	 */
-	if (nworkers > 0)
-	{
-		int			i;
-
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(lps->pcxt);
-
-		for (i = 0; i < lps->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
-	}
-
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
-}
-
-/*
- * Index vacuum/cleanup routine used by the leader process and parallel
- * vacuum worker processes to process the indexes in parallel.
- */
-static void
-parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-					  LVDeadTuples *dead_tuples, int nindexes,
-					  LVRelStats *vacrelstats)
-{
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	/* Loop until all indexes are vacuumed */
-	for (;;)
-	{
-		int			idx;
-		LVSharedIndStats *shared_indstats;
-
-		/* Get an index number to process */
-		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
-
-		/* Done for all indexes? */
-		if (idx >= nindexes)
-			break;
-
-		/* Get the index statistics of this index from DSM */
-		shared_indstats = get_indstats(lvshared, idx);
-
-		/*
-		 * Skip processing indexes that don't participate in parallel
-		 * operation
-		 */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[idx], lvshared))
-			continue;
-
-		/* Do vacuum or cleanup of the index */
-		vacuum_one_index(Irel[idx], &(vacrelstats->indstats[idx]), lvshared,
-						 shared_indstats, dead_tuples, vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup indexes that can be processed by only the leader process
- * because these indexes don't support parallel operation at that phase.
- */
-static void
-vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-					  LVParallelState *lps, int nindexes)
-{
-	int			i;
-
-	Assert(!IsParallelWorker());
-
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	for (i = 0; i < nindexes; i++)
-	{
-		LVSharedIndStats *shared_indstats;
-
-		shared_indstats = get_indstats(lps->lvshared, i);
-
-		/* Process the indexes skipped by parallel workers */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
-			vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared,
-							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup index either by leader process or by one of the worker
- * process.  After processing the index this function copies the index
- * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
- * segment.
- */
-static void
-vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
-{
-	IndexBulkDeleteResult *bulkdelete_res = NULL;
-
-	if (shared_indstats)
-	{
-		/* Get the space for IndexBulkDeleteResult */
-		bulkdelete_res = &(shared_indstats->stats);
-
-		/*
-		 * Update the pointer to the corresponding bulk-deletion result if
-		 * someone has already updated it.
-		 */
-		if (shared_indstats->updated && *stats == NULL)
-			*stats = bulkdelete_res;
-	}
-
-	/* Do vacuum or cleanup of the index */
-	if (lvshared->for_cleanup)
-		lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-						   lvshared->estimated_count, vacrelstats);
-	else
-		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
-
-	/*
-	 * Copy the index bulk-deletion result returned from ambulkdelete and
-	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
-	 * allocate locally and it's possible that an index will be vacuumed by a
-	 * different vacuum process the next cycle.  Copying the result normally
-	 * happens only the first time an index is vacuumed.  For any additional
-	 * vacuum pass, we directly point to the result on the DSM segment and
-	 * pass it to vacuum index APIs so that workers can update it directly.
-	 *
-	 * Since all vacuum workers write the bulk-deletion result at different
-	 * slots we can write them without locking.
-	 */
-	if (shared_indstats && !shared_indstats->updated && *stats != NULL)
-	{
-		memcpy(bulkdelete_res, *stats, sizeof(IndexBulkDeleteResult));
-		shared_indstats->updated = true;
-
-		/*
-		 * Now that stats[idx] points to the DSM segment, we don't need the
-		 * locally allocated results.
-		 */
-		pfree(*stats);
-		*stats = bulkdelete_res;
-	}
-}
-
-/*
- *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
- *
- * Cleanup indexes.  We process the indexes serially unless we are doing
- * parallel vacuum.
- */
-static void
-lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-						 LVParallelState *lps, int nindexes)
-{
-	int			idx;
-
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
-
-	/* Report that we are now cleaning up indexes */
+	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
-	/*
-	 * If parallel vacuum is active we perform index cleanup with parallel
-	 * workers.
-	 */
-	if (ParallelVacuumIsActive(lps))
+	if (!vacrel->lps)
 	{
-		/* Tell parallel workers to do index cleanup */
-		lps->lvshared->for_cleanup = true;
-		lps->lvshared->first_time =
-			(vacrelstats->num_index_scans == 0);
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-		/*
-		 * Now we can provide a better estimate of total number of surviving
-		 * tuples (we assume indexes are more interested in that than in the
-		 * number of nominally live tuples).
-		 */
-		lps->lvshared->reltuples = vacrelstats->new_rel_tuples;
-		lps->lvshared->estimated_count =
-			(vacrelstats->tupcount_pages < vacrelstats->rel_pages);
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+			vacrel->indstats[idx] =
+				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
+									  vacrel);
+		}
 	}
 	else
 	{
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_cleanup_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							   vacrelstats->new_rel_tuples,
-							   vacrelstats->tupcount_pages < vacrelstats->rel_pages,
-							   vacrelstats);
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
+
+	/* Increase and report the number of index scans */
+	vacrel->num_index_scans++;
+	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
+								 vacrel->num_index_scans);
 }
 
 /*
- *	lazy_vacuum_index() -- vacuum one index relation.
+ *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index entries pointing to tuples listed in
  *		dead_tuples, and update running statistics.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					  double reltuples, LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2495,7 +1958,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2503,38 +1966,76 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_VACUUM_INDEX,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	*stats = index_bulk_delete(&ivinfo, *stats,
-							   lazy_tid_reaped, (void *) dead_tuples);
+	istat = index_bulk_delete(&ivinfo, istat, lazy_tid_reaped,
+							  (void *) vacrel->dead_tuples);
 
 	ereport(elevel,
 			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
+					vacrel->indname, vacrel->dead_tuples->num_tuples),
 			 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
 }
 
 /*
- *	lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
+ */
+static void
+lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	Assert(vacrel->nindexes > 0);
+
+	/* Report that we are now cleaning up indexes */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+
+	if (!vacrel->lps)
+	{
+		double		reltuples = vacrel->new_rel_tuples;
+		bool		estimated_count =
+		vacrel->tupcount_pages < vacrel->rel_pages;
+
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
+
+			vacrel->indstats[idx] =
+				lazy_cleanup_one_index(indrel, istat, reltuples,
+									   estimated_count, vacrel);
+		}
+	}
+	else
+	{
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_cleanup_all_indexes(vacrel);
+	}
+}
+
+/*
+ *	lazy_cleanup_one_index() -- do post-vacuum cleanup for index relation.
  *
  *		reltuples is the number of heap tuples and estimated_count is true
  *		if reltuples is an estimated value.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_cleanup_index(Relation indrel,
-				   IndexBulkDeleteResult **stats,
-				   double reltuples, bool estimated_count, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					   double reltuples, bool estimated_count,
+					   LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2549,7 +2050,7 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.message_level = elevel;
 
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2557,35 +2058,252 @@ lazy_cleanup_index(Relation indrel,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	*stats = index_vacuum_cleanup(&ivinfo, *stats);
+	istat = index_vacuum_cleanup(&ivinfo, istat);
 
-	if (*stats)
+	if (istat)
 	{
 		ereport(elevel,
 				(errmsg("index \"%s\" now contains %.0f row versions in %u pages",
 						RelationGetRelationName(indrel),
-						(*stats)->num_index_tuples,
-						(*stats)->num_pages),
+						(istat)->num_index_tuples,
+						(istat)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
 						   "%u index pages were newly deleted.\n"
 						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
-						   (*stats)->tuples_removed,
-						   (*stats)->pages_newly_deleted,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (istat)->tuples_removed,
+						   (istat)->pages_newly_deleted,
+						   (istat)->pages_deleted, (istat)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ *		This routine marks dead tuples as unused and compacts out free
+ *		space on their pages.  Pages not having dead tuples recorded from
+ *		lazy_scan_heap are not visited at all.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	int			tupindex;
+	int			vacuumed_pages;
+	PGRUsage	ru0;
+	Buffer		vmbuffer = InvalidBuffer;
+	LVSavedErrInfo saved_err_info;
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	pg_rusage_init(&ru0);
+	vacuumed_pages = 0;
+
+	tupindex = 0;
+	while (tupindex < vacrel->dead_tuples->num_tuples)
+	{
+		BlockNumber tblk;
+		Buffer		buf;
+		Page		page;
+		Size		freespace;
+
+		vacuum_delay_point();
+
+		tblk = ItemPointerGetBlockNumber(&vacrel->dead_tuples->itemptrs[tupindex]);
+		vacrel->blkno = tblk;
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, tblk,
+								 RBM_NORMAL, vacrel->bstrategy);
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		tupindex = lazy_vacuum_heap_page(vacrel, tblk, buf, tupindex,
+										 &vmbuffer);
+
+		/* Now that we've compacted the page, record its available space */
+		page = BufferGetPage(buf);
+		freespace = PageGetHeapFreeSpace(page);
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->onerel, tblk, freespace);
+		vacuumed_pages++;
+	}
+
+	/* Clear the block number information */
+	vacrel->blkno = InvalidBlockNumber;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	ereport(elevel,
+			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
+					vacrel->relname, tupindex, vacuumed_pages),
+			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+}
+
+/*
+ *	lazy_vacuum_heap_page() -- free dead tuples on a page
+ *						  and repair its fragmentation.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
+ * this page.  We assume the rest follow sequentially.  The return value is
+ * the first tupindex after the tuples of this page.
+ */
+static int
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
+					  int tupindex, Buffer *vmbuffer)
+{
+	LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+	Page		page = BufferGetPage(buffer);
+	OffsetNumber unused[MaxOffsetNumber];
+	int			uncnt = 0;
+	TransactionId visibility_cutoff_xid;
+	bool		all_frozen;
+	LVSavedErrInfo saved_err_info;
+
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
+							 InvalidOffsetNumber);
+
+	START_CRIT_SECTION();
+
+	for (; tupindex < dead_tuples->num_tuples; tupindex++)
+	{
+		BlockNumber tblk;
+		OffsetNumber toff;
+		ItemId		itemid;
+
+		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
+		if (tblk != blkno)
+			break;				/* past end of tuples for this block */
+		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
+		itemid = PageGetItemId(page, toff);
+		ItemIdSetUnused(itemid);
+		unused[uncnt++] = toff;
+	}
+
+	PageRepairFragmentation(page);
+
+	/*
+	 * Mark buffer dirty before we write WAL.
+	 */
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(vacrel->onerel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_clean(vacrel->onerel, buffer,
+								NULL, 0, NULL, 0,
+								unused, uncnt,
+								vacrel->latestRemovedXid);
+		PageSetLSN(page, recptr);
+	}
+
+	/*
+	 * End critical section, so we safely can do visibility tests (which
+	 * possibly need to perform IO and allocate memory!). If we crash now the
+	 * page (including the corresponding vm bit) might not be marked all
+	 * visible, but that's fine. A later vacuum will fix that.
+	 */
+	END_CRIT_SECTION();
+
+	/*
+	 * Now that we have removed the dead tuples from the page, once again
+	 * check if the page has become all-visible.  The page is already marked
+	 * dirty, exclusively locked, and, if needed, a full page image has been
+	 * emitted in the log_heap_clean() above.
+	 */
+	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
+								 &all_frozen))
+		PageSetAllVisible(page);
+
+	/*
+	 * All the changes to the heap page have been done. If the all-visible
+	 * flag is now set, also set the VM all-visible bit (and, if possible, the
+	 * all-frozen bit) unless this has already been done previously.
+	 */
+	if (PageIsAllVisible(page))
+	{
+		uint8		vm_status = visibilitymap_get_status(vacrel->onerel, blkno, vmbuffer);
+		uint8		flags = 0;
+
+		/* Set the VM all-frozen bit to flag, if needed */
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			flags |= VISIBILITYMAP_ALL_VISIBLE;
+		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		Assert(BufferIsValid(*vmbuffer));
+		if (flags != 0)
+			visibilitymap_set(vacrel->onerel, blkno, buffer, InvalidXLogRecPtr,
+							  *vmbuffer, visibility_cutoff_xid, flags);
+	}
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	return tupindex;
+}
+
+/*
+ * Update index statistics in pg_class if the statistics are accurate.
+ */
+static void
+update_index_statistics(LVRelState *vacrel)
+{
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+
+	Assert(!IsInParallelMode());
+
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		Relation	indrel = indrels[idx];
+		IndexBulkDeleteResult *istat = indstats[idx];
+
+		if (istat == NULL || istat->estimated_count)
+			continue;
+
+		/* Update index statistics */
+		vac_update_relstats(indrel,
+							istat->num_pages,
+							istat->num_index_tuples,
+							0,
+							false,
+							InvalidTransactionId,
+							InvalidMultiXactId,
+							false);
+	}
 }
 
 /*
@@ -2608,17 +2326,17 @@ lazy_cleanup_index(Relation indrel,
  * careful to depend only on fields that lazy_scan_heap updates on-the-fly.
  */
 static bool
-should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
+should_attempt_truncation(LVRelState *vacrel, VacuumParams *params)
 {
 	BlockNumber possibly_freeable;
 
 	if (params->truncate == VACOPT_TERNARY_DISABLED)
 		return false;
 
-	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
+		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION) &&
 		old_snapshot_threshold < 0)
 		return true;
 	else
@@ -2629,9 +2347,10 @@ should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
  * lazy_truncate_heap - try to truncate off any empty pages at the end
  */
 static void
-lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
+lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber old_rel_pages = vacrelstats->rel_pages;
+	Relation	onerel = vacrel->onerel;
+	BlockNumber old_rel_pages = vacrel->rel_pages;
 	BlockNumber new_rel_pages;
 	int			lock_retry;
 
@@ -2655,7 +2374,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * (which is quite possible considering we already hold a lower-grade
 		 * lock).
 		 */
-		vacrelstats->lock_waiter_detected = false;
+		vacrel->lock_waiter_detected = false;
 		lock_retry = 0;
 		while (true)
 		{
@@ -2675,10 +2394,10 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 				 * We failed to establish the lock in the specified number of
 				 * retries. This means we give up truncating.
 				 */
-				vacrelstats->lock_waiter_detected = true;
+				vacrel->lock_waiter_detected = true;
 				ereport(elevel,
 						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
-								vacrelstats->relname)));
+								vacrel->relname)));
 				return;
 			}
 
@@ -2694,11 +2413,11 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		if (new_rel_pages != old_rel_pages)
 		{
 			/*
-			 * Note: we intentionally don't update vacrelstats->rel_pages with
-			 * the new rel size here.  If we did, it would amount to assuming
-			 * that the new pages are empty, which is unlikely. Leaving the
-			 * numbers alone amounts to assuming that the new pages have the
-			 * same tuple density as existing ones, which is less unlikely.
+			 * Note: we intentionally don't update vacrel->rel_pages with the
+			 * new rel size here.  If we did, it would amount to assuming that
+			 * the new pages are empty, which is unlikely. Leaving the numbers
+			 * alone amounts to assuming that the new pages have the same
+			 * tuple density as existing ones, which is less unlikely.
 			 */
 			UnlockRelation(onerel, AccessExclusiveLock);
 			return;
@@ -2710,8 +2429,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
-		vacrelstats->blkno = new_rel_pages;
+		new_rel_pages = count_nondeletable_pages(vacrel);
+		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
 		{
@@ -2739,18 +2458,18 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
-		vacrelstats->rel_pages = new_rel_pages;
+		vacrel->pages_removed += old_rel_pages - new_rel_pages;
+		vacrel->rel_pages = new_rel_pages;
 
 		ereport(elevel,
 				(errmsg("\"%s\": truncated %u to %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						old_rel_pages, new_rel_pages),
 				 errdetail_internal("%s",
 									pg_rusage_show(&ru0))));
 		old_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrelstats->nonempty_pages &&
-			 vacrelstats->lock_waiter_detected);
+	} while (new_rel_pages > vacrel->nonempty_pages &&
+			 vacrel->lock_waiter_detected);
 }
 
 /*
@@ -2759,8 +2478,9 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
+count_nondeletable_pages(LVRelState *vacrel)
 {
+	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
 	instr_time	starttime;
@@ -2774,11 +2494,11 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrelstats->rel_pages;
+	blkno = vacrel->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrelstats->nonempty_pages)
+	while (blkno > vacrel->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2809,9 +2529,9 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 				{
 					ereport(elevel,
 							(errmsg("\"%s\": suspending truncate due to conflicting lock request",
-									vacrelstats->relname)));
+									vacrel->relname)));
 
-					vacrelstats->lock_waiter_detected = true;
+					vacrel->lock_waiter_detected = true;
 					return blkno;
 				}
 				starttime = currenttime;
@@ -2842,8 +2562,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			prefetchedUntil = prefetchStart;
 		}
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 vacrel->bstrategy);
 
 		/* In this phase we only need shared access to the buffer */
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
@@ -2891,7 +2611,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrelstats->nonempty_pages;
+	return vacrel->nonempty_pages;
 }
 
 /*
@@ -2930,18 +2650,62 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelState *vacrel, int nworkers, BlockNumber nblocks)
 {
-	LVDeadTuples *dead_tuples = NULL;
+	LVDeadTuples *dead_tuples;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	/*
+	 * Initialize state for a parallel vacuum.  As of now, only one worker can
+	 * be used for an index, so we invoke parallelism only if there are at
+	 * least two indexes on a table.
+	 */
+	if (nworkers >= 0 && vacrel->nindexes > 1)
+	{
+		/*
+		 * Since parallel workers cannot access data in temporary tables, we
+		 * can't perform parallel vacuum on them.
+		 */
+		if (RelationUsesLocalBuffers(vacrel->onerel))
+		{
+			/*
+			 * Give warning only if the user explicitly tries to perform a
+			 * parallel vacuum on the temporary table.
+			 */
+			if (nworkers > 0)
+				ereport(WARNING,
+						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
+								vacrel->relname)));
+		}
+		else
+			vacrel->lps = begin_parallel_vacuum(vacrel, nblocks, nworkers);
+
+		/* If parallel mode started, we're done */
+		if (vacrel->lps != NULL)
+			return;
+	}
+
+	maxtuples = compute_max_dead_tuples(nblocks, vacrel->nindexes > 0);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
+}
+
+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+	if (!vacrel->lps)
+		return;
+
+	/*
+	 * End parallel mode before updating index statistics as we cannot write
+	 * during parallel mode.
+	 */
+	end_parallel_vacuum(vacrel);
 }
 
 /*
@@ -3039,8 +2803,7 @@ vac_cmp_itemptr(const void *left, const void *right)
  * on this page is frozen.
  */
 static bool
-heap_page_is_all_visible(Relation rel, Buffer buf,
-						 LVRelStats *vacrelstats,
+heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 						 TransactionId *visibility_cutoff_xid,
 						 bool *all_frozen)
 {
@@ -3069,7 +2832,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* Unused or redirect line pointers are of no interest */
@@ -3093,9 +2856,9 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 		tuple.t_len = ItemIdGetLength(itemid);
-		tuple.t_tableOid = RelationGetRelid(rel);
+		tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3114,7 +2877,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
@@ -3148,7 +2911,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return all_visible;
 }
@@ -3167,14 +2930,13 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * vacuum.
  */
 static int
-compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+compute_parallel_vacuum_workers(LVRelState *vacrel, int nrequested,
 								bool *can_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
 	int			parallel_workers;
-	int			i;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -3186,15 +2948,16 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = vacrel->indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		if (vacoptions == VACUUM_OPTION_NO_PARALLEL ||
-			RelationGetNumberOfBlocks(Irel[i]) < min_parallel_index_scan_size)
+			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		can_parallel_vacuum[i] = true;
+		can_parallel_vacuum[idx] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -3223,70 +2986,19 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	return parallel_workers;
 }
 
-/*
- * Initialize variables for shared index statistics, set NULL bitmap and the
- * size of stats for each index.
- */
-static void
-prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-						 int nindexes)
-{
-	int			i;
-
-	/* Currently, we don't support parallel vacuum for autovacuum */
-	Assert(!IsAutoVacuumWorkerProcess());
-
-	/* Set NULL for all indexes */
-	memset(lvshared->bitmap, 0x00, BITMAPLEN(nindexes));
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (!can_parallel_vacuum[i])
-			continue;
-
-		/* Set NOT NULL as this index does support parallelism */
-		lvshared->bitmap[i >> 3] |= 1 << (i & 0x07);
-	}
-}
-
-/*
- * Update index statistics in pg_class if the statistics are accurate.
- */
-static void
-update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-						int nindexes)
-{
-	int			i;
-
-	Assert(!IsInParallelMode());
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (stats[i] == NULL || stats[i]->estimated_count)
-			continue;
-
-		/* Update index statistics */
-		vac_update_relstats(Irel[i],
-							stats[i]->num_pages,
-							stats[i]->num_index_tuples,
-							0,
-							false,
-							InvalidTransactionId,
-							InvalidMultiXactId,
-							false);
-	}
-}
-
 /*
  * This function prepares and returns parallel vacuum state if we can launch
  * even one worker.  This function is responsible for entering parallel mode,
  * create a parallel context, and then initialize the DSM segment.
  */
 static LVParallelState *
-begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
-					  BlockNumber nblocks, int nindexes, int nrequested)
+begin_parallel_vacuum(LVRelState *vacrel, BlockNumber nblocks,
+					  int nrequested)
 {
 	LVParallelState *lps = NULL;
+	Relation	onerel = vacrel->onerel;
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
 	ParallelContext *pcxt;
 	LVShared   *shared;
 	LVDeadTuples *dead_tuples;
@@ -3299,7 +3011,6 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
-	int			i;
 
 	/*
 	 * A parallel vacuum must be requested and there must be indexes on the
@@ -3312,7 +3023,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	can_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = compute_parallel_vacuum_workers(Irel, nindexes,
+	parallel_workers = compute_parallel_vacuum_workers(vacrel,
 													   nrequested,
 													   can_parallel_vacuum);
 
@@ -3333,9 +3044,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 
 	/* Estimate size for shared information -- PARALLEL_VACUUM_KEY_SHARED */
 	est_shared = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		/*
 		 * Cleanup option should be either disabled, always performing in
@@ -3346,10 +3058,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
 		/* Skip indexes that don't participate in parallel vacuum */
-		if (!can_parallel_vacuum[i])
+		if (!can_parallel_vacuum[idx])
 			continue;
 
-		if (Irel[i]->rd_indam->amusemaintenanceworkmem)
+		if (indrel->rd_indam->amusemaintenanceworkmem)
 			nindexes_mwm++;
 
 		est_shared = add_size(est_shared, sizeof(LVSharedIndStats));
@@ -3404,7 +3116,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	/* Prepare shared information */
 	shared = (LVShared *) shm_toc_allocate(pcxt->toc, est_shared);
 	MemSet(shared, 0, est_shared);
-	shared->relid = relid;
+	shared->onereloid = RelationGetRelid(onerel);
 	shared->elevel = elevel;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
@@ -3415,7 +3127,20 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
 	pg_atomic_init_u32(&(shared->idx), 0);
 	shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
+
+	/*
+	 * Initialize variables for shared index statistics, set NULL bitmap and
+	 * the size of stats for each index.
+	 */
+	memset(shared->bitmap, 0x00, BITMAPLEN(nindexes));
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		if (!can_parallel_vacuum[idx])
+			continue;
+
+		/* Set NOT NULL as this index does support parallelism */
+		shared->bitmap[idx >> 3] |= 1 << (idx & 0x07);
+	}
 
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	lps->lvshared = shared;
@@ -3426,7 +3151,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	dead_tuples->num_tuples = 0;
 	MemSet(dead_tuples->itemptrs, 0, sizeof(ItemPointerData) * maxtuples);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_TUPLES, dead_tuples);
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
 
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
@@ -3467,32 +3192,35 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
  * context, but that won't be safe (see ExitParallelMode).
  */
 static void
-end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
-					int nindexes)
+end_parallel_vacuum(LVRelState *vacrel)
 {
-	int			i;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+	LVParallelState *lps = vacrel->lps;
+	int			nindexes = vacrel->nindexes;
 
 	Assert(!IsParallelWorker());
 
 	/* Copy the updated statistics */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		LVSharedIndStats *indstats = get_indstats(lps->lvshared, i);
+		LVSharedIndStats *shared_istat;
+
+		shared_istat = parallel_stats_for_idx(lps->lvshared, idx);
 
 		/*
 		 * Skip unused slot.  The statistics of this index are already stored
 		 * in local memory.
 		 */
-		if (indstats == NULL)
+		if (shared_istat == NULL)
 			continue;
 
-		if (indstats->updated)
+		if (shared_istat->updated)
 		{
-			stats[i] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-			memcpy(stats[i], &(indstats->stats), sizeof(IndexBulkDeleteResult));
+			indstats[idx] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+			memcpy(indstats[idx], &(shared_istat->istat), sizeof(IndexBulkDeleteResult));
 		}
 		else
-			stats[i] = NULL;
+			indstats[idx] = NULL;
 	}
 
 	DestroyParallelContext(lps->pcxt);
@@ -3500,23 +3228,364 @@ end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
 
 	/* Deactivate parallel vacuum */
 	pfree(lps);
-	lps = NULL;
+	vacrel->lps = NULL;
 }
 
-/* Return the Nth index statistics or NULL */
-static LVSharedIndStats *
-get_indstats(LVShared *lvshared, int n)
+static void
+do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel)
+{
+	/* Tell parallel workers to do index vacuuming */
+	vacrel->lps->lvshared->for_cleanup = false;
+	vacrel->lps->lvshared->first_time = false;
+
+	/*
+	 * We can only provide an approximate value of num_heap_tuples in vacuum
+	 * cases.
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->old_live_tuples;
+	vacrel->lps->lvshared->estimated_count = true;
+
+	do_parallel_vacuum_or_cleanup(vacrel,
+								  vacrel->lps->nindexes_parallel_bulkdel);
+}
+
+static void
+do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	/*
+	 * If parallel vacuum is active we perform index cleanup with parallel
+	 * workers.
+	 *
+	 * Tell parallel workers to do index cleanup.
+	 */
+	vacrel->lps->lvshared->for_cleanup = true;
+	vacrel->lps->lvshared->first_time = (vacrel->num_index_scans == 0);
+
+	/*
+	 * Now we can provide a better estimate of total number of surviving
+	 * tuples (we assume indexes are more interested in that than in the
+	 * number of nominally live tuples).
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
+	vacrel->lps->lvshared->estimated_count =
+		(vacrel->tupcount_pages < vacrel->rel_pages);
+
+	/* Determine the number of parallel workers to launch */
+	if (vacrel->lps->lvshared->first_time)
+		nworkers = vacrel->lps->nindexes_parallel_cleanup +
+			vacrel->lps->nindexes_parallel_condcleanup;
+	else
+		nworkers = vacrel->lps->nindexes_parallel_cleanup;
+
+	do_parallel_vacuum_or_cleanup(vacrel, nworkers);
+}
+
+/*
+ * Perform index vacuum or index cleanup with parallel workers.  This function
+ * must be used by the parallel vacuum leader process.  The caller must set
+ * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
+ * cleanup.
+ */
+static void
+do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers)
+{
+	LVParallelState *lps = vacrel->lps;
+
+	Assert(!IsParallelWorker());
+	Assert(vacrel->nindexes > 0);
+
+	/* The leader process will participate */
+	nworkers--;
+
+	/*
+	 * It is possible that parallel context is initialized with fewer workers
+	 * than the number of indexes that need a separate worker in the current
+	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
+	 */
+	nworkers = Min(nworkers, lps->pcxt->nworkers);
+
+	/* Setup the shared cost-based vacuum delay and launch workers */
+	if (nworkers > 0)
+	{
+		if (vacrel->num_index_scans > 0)
+		{
+			/* Reset the parallel index processing counter */
+			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
+
+			/* Reinitialize the parallel context to relaunch parallel workers */
+			ReinitializeParallelDSM(lps->pcxt);
+		}
+
+		/*
+		 * Set up shared cost balance and the number of active workers for
+		 * vacuum delay.  We need to do this before launching workers as
+		 * otherwise, they might not see the updated values for these
+		 * parameters.
+		 */
+		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
+		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
+
+		/*
+		 * The number of workers can vary between bulkdelete and cleanup
+		 * phase.
+		 */
+		ReinitializeParallelWorkers(lps->pcxt, nworkers);
+
+		LaunchParallelWorkers(lps->pcxt);
+
+		if (lps->pcxt->nworkers_launched > 0)
+		{
+			/*
+			 * Reset the local cost values for leader backend as we have
+			 * already accumulated the remaining balance of heap.
+			 */
+			VacuumCostBalance = 0;
+			VacuumCostBalanceLocal = 0;
+
+			/* Enable shared cost balance for leader backend */
+			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
+			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
+		}
+
+		if (lps->lvshared->for_cleanup)
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
+									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+		else
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
+									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+	}
+
+	/* Process the indexes that can be processed by only leader process */
+	do_serial_processing_for_unsafe_indexes(vacrel, lps->lvshared);
+
+	/*
+	 * Join as a parallel worker.  The leader process alone processes all the
+	 * indexes in the case where no workers are launched.
+	 */
+	do_parallel_processing(vacrel, lps->lvshared);
+
+	/*
+	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
+	 * to finish, or we might get incomplete data.)
+	 */
+	if (nworkers > 0)
+	{
+		/* Wait for all vacuum workers to finish */
+		WaitForParallelWorkersToFinish(lps->pcxt);
+
+		for (int i = 0; i < lps->pcxt->nworkers_launched; i++)
+			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
+	}
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+}
+
+/*
+ * Index vacuum/cleanup routine used by the leader process and parallel
+ * vacuum worker processes to process the indexes in parallel.
+ */
+static void
+do_parallel_processing(LVRelState *vacrel, LVShared *lvshared)
+{
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Loop until all indexes are vacuumed */
+	for (;;)
+	{
+		int			idx;
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		/* Get an index number to process */
+		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
+
+		/* Done for all indexes? */
+		if (idx >= vacrel->nindexes)
+			break;
+
+		/* Get the index statistics of this index from DSM */
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip indexes not participating in parallelism */
+		if (shared_istat == NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * Skip processing indexes that are unsafe for workers (these are
+		 * processed in do_serial_processing_for_unsafe_indexes() by leader)
+		 */
+		if (!parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup indexes that can be processed by only the leader process
+ * because these indexes don't support parallel operation at that phase.
+ */
+static void
+do_serial_processing_for_unsafe_indexes(LVRelState *vacrel, LVShared *lvshared)
+{
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
+	{
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip already-complete indexes */
+		if (shared_istat != NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * We're only here for the unsafe indexes
+		 */
+		if (parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup index either by leader process or by one of the worker
+ * process.  After processing the index this function copies the index
+ * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
+ * segment.
+ */
+static IndexBulkDeleteResult *
+parallel_process_one_index(Relation indrel,
+						   IndexBulkDeleteResult *istat,
+						   LVShared *lvshared,
+						   LVSharedIndStats *shared_istat,
+						   LVRelState *vacrel)
+{
+	IndexBulkDeleteResult *bulkdelete_res = NULL;
+
+	if (shared_istat)
+	{
+		/* Get the space for IndexBulkDeleteResult */
+		bulkdelete_res = &(shared_istat->istat);
+
+		/*
+		 * Update the pointer to the corresponding bulk-deletion result if
+		 * someone has already updated it.
+		 */
+		if (shared_istat->updated && istat == NULL)
+			istat = bulkdelete_res;
+	}
+
+	/* Do vacuum or cleanup of the index */
+	if (lvshared->for_cleanup)
+		istat = lazy_cleanup_one_index(indrel, istat, lvshared->reltuples,
+									   lvshared->estimated_count, vacrel);
+	else
+		istat = lazy_vacuum_one_index(indrel, istat, lvshared->reltuples,
+									  vacrel);
+
+	/*
+	 * Copy the index bulk-deletion result returned from ambulkdelete and
+	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
+	 * allocate locally and it's possible that an index will be vacuumed by a
+	 * different vacuum process the next cycle.  Copying the result normally
+	 * happens only the first time an index is vacuumed.  For any additional
+	 * vacuum pass, we directly point to the result on the DSM segment and
+	 * pass it to vacuum index APIs so that workers can update it directly.
+	 *
+	 * Since all vacuum workers write the bulk-deletion result at different
+	 * slots we can write them without locking.
+	 */
+	if (shared_istat && !shared_istat->updated && istat != NULL)
+	{
+		memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+		shared_istat->updated = true;
+
+		/*
+		 * Now that top-level indstats[idx] points to the DSM segment, we
+		 * don't need the locally allocated results.
+		 */
+		pfree(istat);
+		istat = bulkdelete_res;
+	}
+
+	return istat;
+}
+
+/*
+ * Return shared memory statistics for index at offset 'getidx', if any
+ */
+static LVSharedIndStats *
+parallel_stats_for_idx(LVShared *lvshared, int getidx)
 {
-	int			i;
 	char	   *p;
 
-	if (IndStatsIsNull(lvshared, n))
+	if (IndStatsIsNull(lvshared, getidx))
 		return NULL;
 
 	p = (char *) GetSharedIndStats(lvshared);
-	for (i = 0; i < n; i++)
+	for (int idx = 0; idx < getidx; idx++)
 	{
-		if (IndStatsIsNull(lvshared, i))
+		if (IndStatsIsNull(lvshared, idx))
 			continue;
 
 		p += sizeof(LVSharedIndStats);
@@ -3526,11 +3595,11 @@ get_indstats(LVShared *lvshared, int n)
 }
 
 /*
- * Returns true, if the given index can't participate in parallel index vacuum
- * or parallel index cleanup, false, otherwise.
+ * Returns false, if the given index can't participate in parallel index
+ * vacuum or parallel index cleanup
  */
 static bool
-skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
+parallel_processing_is_safe(Relation indrel, LVShared *lvshared)
 {
 	uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
@@ -3552,15 +3621,15 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 		 */
 		if (!lvshared->first_time &&
 			((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
-			return true;
+			return false;
 	}
 	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0)
 	{
 		/* Skip if the index does not support parallel bulk deletion */
-		return true;
+		return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3580,7 +3649,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	LVRelStats	vacrelstats;
+	LVRelState	vacrel;
 	ErrorContextCallback errcallback;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3602,7 +3671,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * okay because the lock mode does not conflict among the parallel
 	 * workers.
 	 */
-	onerel = table_open(lvshared->relid, ShareUpdateExclusiveLock);
+	onerel = table_open(lvshared->onereloid, ShareUpdateExclusiveLock);
 
 	/*
 	 * Open all indexes. indrels are sorted in order by OID, which should be
@@ -3626,24 +3695,27 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
 	VacuumActiveNWorkers = &(lvshared->active_nworkers);
 
-	vacrelstats.indstats = (IndexBulkDeleteResult **)
+	vacrel.onerel = onerel;
+	vacrel.indrels = indrels;
+	vacrel.nindexes = nindexes;
+	vacrel.indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	if (lvshared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = lvshared->maintenance_work_mem_worker;
 
 	/*
-	 * Initialize vacrelstats for use as error callback arg by parallel
-	 * worker.
+	 * Initialize vacrel for use as error callback arg by parallel worker.
 	 */
-	vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats.relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats.indname = NULL;
-	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
+	vacrel.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel.relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_UNKNOWN;	/* Not yet processing */
+	vacrel.dead_tuples = dead_tuples;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = &vacrelstats;
+	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
@@ -3651,8 +3723,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_index(indrels, lvshared, dead_tuples, nindexes,
-						  &vacrelstats);
+	do_parallel_processing(&vacrel, lvshared);
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -3665,7 +3736,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
-	pfree(vacrelstats.indstats);
+	pfree(vacrel.indstats);
 }
 
 /*
@@ -3674,7 +3745,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 static void
 vacuum_error_callback(void *arg)
 {
-	LVRelStats *errinfo = arg;
+	LVRelState *errinfo = arg;
 
 	switch (errinfo->phase)
 	{
@@ -3736,28 +3807,29 @@ vacuum_error_callback(void *arg)
  * the current information which can be later restored via restore_vacuum_error_info.
  */
 static void
-update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info, int phase,
-						 BlockNumber blkno, OffsetNumber offnum)
+update_vacuum_error_info(LVRelState *vacrel, LVSavedErrInfo *saved_vacrel,
+						 int phase, BlockNumber blkno, OffsetNumber offnum)
 {
-	if (saved_err_info)
+	if (saved_vacrel)
 	{
-		saved_err_info->offnum = errinfo->offnum;
-		saved_err_info->blkno = errinfo->blkno;
-		saved_err_info->phase = errinfo->phase;
+		saved_vacrel->offnum = vacrel->offnum;
+		saved_vacrel->blkno = vacrel->blkno;
+		saved_vacrel->phase = vacrel->phase;
 	}
 
-	errinfo->blkno = blkno;
-	errinfo->offnum = offnum;
-	errinfo->phase = phase;
+	vacrel->blkno = blkno;
+	vacrel->offnum = offnum;
+	vacrel->phase = phase;
 }
 
 /*
  * Restores the vacuum information saved via a prior call to update_vacuum_error_info.
  */
 static void
-restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info)
+restore_vacuum_error_info(LVRelState *vacrel,
+						  const LVSavedErrInfo *saved_vacrel)
 {
-	errinfo->blkno = saved_err_info->blkno;
-	errinfo->offnum = saved_err_info->offnum;
-	errinfo->phase = saved_err_info->phase;
+	vacrel->blkno = saved_vacrel->blkno;
+	vacrel->offnum = saved_vacrel->offnum;
+	vacrel->phase = saved_vacrel->phase;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..9b5afa12ad 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -689,7 +689,7 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
  */
 IndexBulkDeleteResult *
 index_bulk_delete(IndexVacuumInfo *info,
-				  IndexBulkDeleteResult *stats,
+				  IndexBulkDeleteResult *istat,
 				  IndexBulkDeleteCallback callback,
 				  void *callback_state)
 {
@@ -698,7 +698,7 @@ index_bulk_delete(IndexVacuumInfo *info,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambulkdelete);
 
-	return indexRelation->rd_indam->ambulkdelete(info, stats,
+	return indexRelation->rd_indam->ambulkdelete(info, istat,
 												 callback, callback_state);
 }
 
@@ -710,14 +710,14 @@ index_bulk_delete(IndexVacuumInfo *info,
  */
 IndexBulkDeleteResult *
 index_vacuum_cleanup(IndexVacuumInfo *info,
-					 IndexBulkDeleteResult *stats)
+					 IndexBulkDeleteResult *istat)
 {
 	Relation	indexRelation = info->index;
 
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(amvacuumcleanup);
 
-	return indexRelation->rd_indam->amvacuumcleanup(info, stats);
+	return indexRelation->rd_indam->amvacuumcleanup(info, istat);
 }
 
 /* ----------------
-- 
2.27.0

v8-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v8-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 946c2742e5da7e82f624ca4fae07c0f105575117 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v8 3/4] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++---
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 ++++++++++------------
 src/backend/access/heap/pruneheap.c      |  60 ++++---
 src/backend/access/heap/vacuumlazy.c     | 211 +++++++++++------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 ++--
 src/backend/replication/logical/decode.c |   4 +-
 src/tools/pgindent/typedefs.list         |   4 +-
 11 files changed, 275 insertions(+), 308 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc0936bc2d..0bef090420 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,7 +180,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..d5df7c20df 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Used by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 90711b2fcd..93bd57118e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7528,7 +7528,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7947,88 +7947,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8500,34 +8418,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8538,12 +8437,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8596,7 +8491,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8608,6 +8503,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9712,15 +9681,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 72cb066e0a..e146c20e33 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -305,7 +305,6 @@ typedef struct LVRelState
 	/* onerel's initial relfrozenxid and relminmxid */
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
-	TransactionId latestRemovedXid;
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
@@ -402,8 +401,7 @@ static void lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf,
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							GlobalVisState *vistest,
 							LVPagePruneState *pageprunestate,
-							LVPageVisMapState *pagevmstate,
-							VacOptTernaryValue index_cleanup);
+							LVPageVisMapState *pagevmstate);
 static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -565,7 +563,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrel->old_live_tuples = onerel->rd_rel->reltuples;
 	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
 	vacrel->relminmxid = onerel->rd_rel->relminmxid;
-	vacrel->latestRemovedXid = InvalidTransactionId;
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
@@ -807,40 +804,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(LVRelState *vacrel)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrel->latestRemovedXid))
-		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
-									 vacrel->latestRemovedXid);
-}
-
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -1287,8 +1250,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * Also handles tuple freezing -- considers freezing XIDs from all
 		 * tuple headers left behind following pruning.
 		 */
-		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate,
-						params->index_cleanup);
+		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -1730,21 +1692,41 @@ lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where heap_page_prune()
+ * was allowed to disagree with our HeapTupleSatisfiesVacuum() call about
+ * whether or not a tuple should be considered DEAD.  This happened when an
+ * inserting transaction concurrently aborted (after our heap_page_prune()
+ * call, before our HeapTupleSatisfiesVacuum() call).  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs are with respect to the OldestXid cutoff.
+ *
+ * This created rare, hard to test cases -- exceptions to the general rule
+ * that TIDs that we enter into the dead_tuples array are in fact just LP_DEAD
+ * items without storage.  We had rather a lot of complexity to account for
+ * tuples that were dead, but still had storage, and so still had a tuple
+ * header with XIDs that were not quite unambiguously after the FreezeLimit
+ * limit.
+ *
+ * The approach we take here now is a little crude, but it's also simple and
+ * robust: we restart pruning when the race condition is detected.  This
+ * guarantees that any items that make it into the dead_tuples array are
+ * simple LP_DEAD line pointers, and that every item with tuple storage is
+ * considered as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 				LVPagePruneState *pageprunestate,
-				LVPageVisMapState *pagevmstate,
-				VacOptTernaryValue index_cleanup)
+				LVPageVisMapState *pagevmstate)
 {
 	Relation	onerel = vacrel->onerel;
-	bool		tupgone;
 	BlockNumber blkno;
 	Page		page;
 	OffsetNumber offnum,
 				maxoff;
 	ItemId		itemid;
 	HeapTupleData tuple;
+	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
 				new_dead_tuples,
@@ -1759,6 +1741,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	blkno = BufferGetBlockNumber(buf);
 	page = BufferGetPage(buf);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
@@ -1776,19 +1760,20 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	 */
 	tuples_deleted = heap_page_prune(onerel, buf, vistest,
 									 InvalidTransactionId, 0, false,
-									 &vacrel->latestRemovedXid,
 									 &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set pagevmstate.visibility_cutoff_xid it
+	 * doesn't matter -- the newest XMIN on page can't be missed this way.
 	 */
 	pageprunestate->hastup = false;
 	pageprunestate->has_lpdead_items = false;
 	pageprunestate->all_visible = true;
 	pageprunestate->all_frozen = true;
 	ntupoffsets = 0;
-	tupgone = false;
 	maxoff = PageGetMaxOffsetNumber(page);
 
 	/*
@@ -1845,6 +1830,17 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+		if (unlikely(res == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1855,42 +1851,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (res)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					new_dead_tuples++;
-				else
-					tupgone = true; /* we can delete the tuple */
-				pageprunestate->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1938,7 +1900,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 
 				/*
 				 * If tuple is recently deleted then we must not remove it
-				 * from relation.
+				 * from relation.  (We only remove items that are LP_DEAD from
+				 * pruning.)
 				 */
 				new_dead_tuples++;
 				pageprunestate->all_visible = false;
@@ -1972,24 +1935,13 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 				break;
 		}
 
-		if (tupgone)
-		{
-			/* Pretend that this is an LP_DEAD item  */
-			deadoffsets[lpdead_items++] = offnum;
-			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrel->latestRemovedXid);
-		}
-		else
-		{
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			tupoffsets[ntupoffsets++] = offnum;
-			num_tuples++;
-			pageprunestate->hastup = true;
-		}
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		tupoffsets[ntupoffsets++] = offnum;
+		num_tuples++;
+		pageprunestate->hastup = true;
 	}
 
 	/*
@@ -2000,9 +1952,6 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	 *
 	 * Add page level counters to caller's counts, and then actually process
 	 * LP_DEAD and LP_NORMAL items.
-	 *
-	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
-	 * to pretend that DEAD items are LP_DEAD items.
 	 */
 	Assert(lpdead_items + ntupoffsets + nunused + nredirect == maxoff);
 	vacrel->offnum = InvalidOffsetNumber;
@@ -2162,9 +2111,6 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(vacrel);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2187,6 +2133,13 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
+	/*
+	 * We delete all LP_DEAD items from the first heap pass in all indexes on
+	 * each call here.  This makes call to lazy_vacuum_heap_rel() safe.
+	 */
+	Assert(vacrel->num_index_scans > 1 ||
+		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
+
 	/* Increase and report the number of index scans */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
@@ -2421,6 +2374,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vmbuffer = InvalidBuffer;
 	}
 
+	/*
+	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
+	 * the second heap pass.  No more, no less.
+	 */
+	Assert(vacrel->num_index_scans > 1 || tupindex == vacrel->lpdead_items);
+
 	ereport(elevel,
 			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
 					vacrel->relname, tupindex, vacuumed_pages),
@@ -2431,14 +2390,25 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free dead tuples on a page
- *						  and repair its fragmentation.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
+ *						  vacrel->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
  * this page.  We assume the rest follow sequentially.  The return value is
  * the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same page was visited by lazy_scan_prune() (i.e. those whose TID
+ * was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
@@ -2474,11 +2444,15 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2488,12 +2462,19 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(vacrel->onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(vacrel->onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrel->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2506,10 +2487,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5f596135b1..391caf7396 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -480,8 +480,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0a75dccb93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3554,8 +3554,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3567,9 +3565,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v8-0004-Skip-index-vacuuming-in-some-cases.patchapplication/octet-stream; name=v8-0004-Skip-index-vacuuming-in-some-cases.patchDownload

From fdee9ec280063fd393af968c80dac56e5738d866 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v8 4/4] Skip index vacuuming in some cases.

Skip index vacuuming in two cases: The case where there are so few dead
tuples that index vacuuming seems unnecessary, and the case where the
relfrozenxid of the table being vacuumed is dangerously far in the past.

This commit add new GUC parameters vacuum_skip_index_age and
vacuum_multixact_skip_index_age that specify age at which VACUUM
should skip index cleanup to hurry finishing in order to
advance relfrozenxid/relminmxid.

After each index vacuuming (in non-parallel vacuum case), we check if
the table's relfrozenxid/relminmxid are too old comparing those new
GUC parameters. If so, we skip further index vacuuming within the
vacuum operation.

This behavior is intended to deal with the risk of XID wraparound, the
default values are much higher, 1.8 billion.

Although users can set those parameters, VACUUM will silently
adjust the effective value more than 105% of
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so that
only anti-wraparound autovacuuma and aggressive scan have a change to
skip index vacuuming.
---
 src/include/commands/vacuum.h                 |   4 +
 src/backend/access/heap/vacuumlazy.c          | 249 +++++++++++++++++-
 src/backend/commands/vacuum.c                 |  61 +++++
 src/backend/utils/misc/guc.c                  |  25 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 doc/src/sgml/config.sgml                      |  51 ++++
 doc/src/sgml/maintenance.sgml                 |  10 +-
 7 files changed, 385 insertions(+), 17 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..d3d44d9bac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_skip_index_age;
+extern int	vacuum_multixact_skip_index_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -270,6 +272,8 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  TransactionId *xidFullScanLimit,
 								  MultiXactId *multiXactCutoff,
 								  MultiXactId *mxactFullScanLimit);
+extern bool vacuum_xid_limit_emergency(TransactionId relfrozenxid,
+									   MultiXactId   relminmxid);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e146c20e33..384a89b74d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -131,6 +131,12 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * The threshold of the percentage of heap blocks having LP_DEAD line pointer
+ * above which index vacuuming goes ahead.
+ */
+#define SKIP_VACUUM_PAGES_RATIO		0.02
+
 /*
  * DSM keys for parallel vacuum.  Unlike other parallel execution code, since
  * we don't need to worry about DSM keys conflicting with plan_node_id we can
@@ -402,8 +408,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							GlobalVisState *vistest,
 							LVPagePruneState *pageprunestate,
 							LVPageVisMapState *pagevmstate);
-static void lazy_vacuum(LVRelState *vacrel);
-static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void lazy_vacuum(LVRelState *vacrel, bool onecall);
+static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
 													double reltuples,
@@ -752,6 +758,31 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
+			if (vacrel->rel_pages > 0)
+			{
+				if (vacrel->do_index_vacuuming)
+				{
+					if (vacrel->num_index_scans == 0)
+						appendStringInfo(&buf, _("index scan not needed:"));
+					else
+						appendStringInfo(&buf, _("index scan needed:"));
+					msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
+				}
+				else
+				{
+					Assert(vacrel->nindexes > 0);
+
+					if (vacrel->do_index_cleanup)
+						appendStringInfo(&buf, _("index scan bypassed:"));
+					else
+						appendStringInfo(&buf, _("index scan bypassed due to emergency:"));
+					msgfmt = _(" %u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
+				}
+				appendStringInfo(&buf, msgfmt,
+								 vacrel->lpdead_item_pages,
+								 100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+								 (long long) vacrel->lpdead_items);
+			}
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -842,7 +873,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -1114,8 +1146,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
+			/*
+			 * Definitely won't be skipping index vacuuming due to finding
+			 * very few dead items during this VACUUM operation -- that's only
+			 * something that lazy_vacuum() is willing to do when it is only
+			 * called once during the entire VACUUM operation.
+			 */
+			have_vacuumed_indexes = true;
+
 			/* Remove the collected garbage tuples from table and indexes */
-			lazy_vacuum(vacrel);
+			lazy_vacuum(vacrel, false);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1268,7 +1308,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		if (vacrel->nindexes > 0 && pageprunestate.has_lpdead_items &&
 			vacrel->do_index_vacuuming)
 		{
-			/* Wait until lazy_vacuum_heap_rel() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap_rel() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap_rel() -- lazy_vacuum() might opt to skip index
+			 * vacuuming (and so must skip heap vacuuming).  This is deemed
+			 * okay because it only happens in emergencies, or when there is
+			 * very little free space anyway.
+			 */
 		}
 		else
 		{
@@ -1370,7 +1418,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-		lazy_vacuum(vacrel);
+		lazy_vacuum(vacrel, !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1398,6 +1446,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * If table has no indexes and at least one heap pages was vacuumed, make
 	 * log report that lazy_vacuum_heap_rel would've made had there been
 	 * indexes (having indexes implies using the two pass strategy).
+	 *
+	 * We deliberately don't do this in the case where there are indexes but
+	 * index vacuuming was bypassed.  We make a similar report at the point
+	 * that index vacuuming is bypassed, but that's actually quite different
+	 * in one important sense: it shows information about work we _haven't_
+	 * done.
+	 *
+	 * log_autovacuum output does things differently; it consistently presents
+	 * information about LP_DEAD items for the VACUUM as a whole.  We always
+	 * report on each round of index and heap vacuuming separately, though.
 	 */
 	Assert(vacrel->nindexes == 0 || vacuumed_pages == 0);
 	if (vacuumed_pages > 0)
@@ -2072,10 +2130,15 @@ retry:
 
 /*
  * Remove the collected garbage tuples from the table and its indexes.
+ *
+ * We may be able to skip index vacuuming (we may even be required to do so by
+ * reloption)
  */
 static void
-lazy_vacuum(LVRelState *vacrel)
+lazy_vacuum(LVRelState *vacrel, bool onecall)
 {
+	bool		applyskipoptimization;
+
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
 	Assert(!IsParallelWorker());
@@ -2087,11 +2150,137 @@ lazy_vacuum(LVRelState *vacrel)
 		return;
 	}
 
-	/* Okay, we're going to do index vacuuming */
-	lazy_vacuum_all_indexes(vacrel);
+	/*
+	 * Consider applying the optimization where we skip index vacuuming to
+	 * save work in indexes that is likely to have little upside.  This is
+	 * expected to help in the extreme (though still common) case where
+	 * autovacuum generally only triggers VACUUMs against the table because of
+	 * the need to freeze tuples and/or the need to set visibility map bits.
+	 * The overall effect is that cases where the table is slightly less than
+	 * 100% append-only (where there are some dead tuples, but very few) tend
+	 * to behave almost as if they really were 100% append-only.
+	 *
+	 * Our approach is to skip index vacuuming when there are very few heap
+	 * pages with dead items.  Even then, it must be the first and last call
+	 * here for the VACUUM (we never apply the optimization when we're low on
+	 * space for TIDs).  This threshold allows us to not give too much weight
+	 * to items that are concentrated in relatively few heap pages.  These are
+	 * usually due to correlated non-HOT UPDATEs.
+	 *
+	 * It's important that we avoid putting off a VACUUM that eventually
+	 * dirties index pages more often than would happen if we didn't skip.
+	 * It's also important to avoid allowing relatively many heap pages that
+	 * can never have their visibility map bit set to stay that way
+	 * indefinitely.
+	 *
+	 * In general the criteria that we apply here must not create distinct new
+	 * problems for the logic that schedules autovacuum workers.  For example,
+	 * we cannot allow autovacuum_vacuum_insert_scale_factor-driven autovacuum
+	 * workers to do little or no useful work due to misapplication of this
+	 * optimization.  While the optimization is expressly designed to avoid
+	 * work that has non-zero value to the system, the value of that work
+	 * should be close to zero.  There should be a natural asymmetry between
+	 * the costs and the benefits of skipping.
+	 */
+	applyskipoptimization = false;
+	if (onecall && vacrel->rel_pages > 0)
+	{
+		BlockNumber threshold;
 
-	/* Remove tuples from heap */
-	lazy_vacuum_heap_rel(vacrel);
+		Assert(vacrel->num_index_scans == 0);
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		threshold = (double) vacrel->rel_pages * SKIP_VACUUM_PAGES_RATIO;
+
+		applyskipoptimization = (vacrel->lpdead_item_pages < threshold);
+	}
+
+	if (applyskipoptimization)
+	{
+		/*
+		 * Skip index vacuuming, but don't skip index cleanup.
+		 *
+		 * It wouldn't make sense to not do cleanup just because this
+		 * optimization was applied.  (As a general rule, the case where there
+		 * are _almost_ zero dead items when vacuuming a large table should
+		 * not behave very differently from the case where there are precisely
+		 * zero dead items.)
+		 */
+		vacrel->do_index_vacuuming = false;
+		ereport(elevel,
+				(errmsg("\"%s\": index scan bypassed: %u pages from table (%.2f%% of total) have %lld dead item identifiers",
+						vacrel->relname, vacrel->rel_pages,
+						100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+						(long long) vacrel->lpdead_items)));
+	}
+	else if (lazy_vacuum_all_indexes(vacrel))
+	{
+		/*
+		 * We successfully completed a round of index vacuuming.  Do related
+		 * heap vacuuming now.
+		 *
+		 * There will be no calls to vacuum_xid_limit_emergency() to check for
+		 * issues with the age of the table's relfrozenxid unless and until
+		 * there is another call here -- heap vacuuming doesn't do that. This
+		 * should be okay, because the cost of a round of heap vacuuming is
+		 * much more linear.  Also, it has costs that are unaffected by the
+		 * number of indexes total.
+		 */
+		lazy_vacuum_heap_rel(vacrel);
+	}
+	else
+	{
+		/*
+		 * Emergency case:  We attempted index vacuuming, didn't finish
+		 * another round of index vacuuming (or one that reliably deleted
+		 * tuples from all of the table's indexes, at least).  This happens
+		 * when the table's relfrozenxid is too far in the past.
+		 *
+		 * From this point on the VACUUM operation will do no further index
+		 * vacuuming or heap vacuuming.  It will do any remaining pruning that
+		 * is required, plus other heap-related and relation-level maintenance
+		 * tasks.  But that's it.  We also disable a cost delay when a delay
+		 * is in effect.
+		 *
+		 * Note that we deliberately don't vary our behavior based on factors
+		 * like whether or not the ongoing VACUUM is aggressive.  If it's not
+		 * aggressive we probably won't be able to advance relfrozenxid during
+		 * this VACUUM.  If we can't, then an anti-wraparound VACUUM should
+		 * take place immediately after we finish up.  We should be able to
+		 * skip all index vacuuming for the later anti-wraparound VACUUM.
+		 *
+		 * This is very much like the "CLEANUP_INDEX = off" case, except we
+		 * determine that index vacuuming will be skipped dynamically. Another
+		 * difference is that we don't warn the user in the INDEX_CLEANUP off
+		 * case, and we don't presume to stop applying a cost delay.
+		 */
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+		ereport(WARNING,
+				(errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a fail safe after %d index scans",
+						get_database_name(MyDatabaseId),
+						vacrel->relname,
+						vacrel->relname,
+						vacrel->num_index_scans),
+				 errdetail("table's relfrozenxid or relminmxid is too far in the past"),
+				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
+						 "You might also need to consider other ways for VACUUM to keep up with the allocation of transaction IDs.")));
+
+		/* Stop applying cost limits from this point on */
+		VacuumCostActive = false;
+		VacuumCostBalance = 0;
+	}
+
+	/*
+	 * TODO:
+	 *
+	 * Call lazy_space_free() and arrange to stop even recording TIDs (i.e.
+	 * make lazy_record_dead_item() into a no-op)
+	 */
 
 	/*
 	 * Forget the now-vacuumed tuples -- just press on
@@ -2101,16 +2290,30 @@ lazy_vacuum(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Returns true in the common case when all indexes were successfully
+ * vacuumed.  Returns false in rare cases where we determined that the ongoing
+ * VACUUM operation is at risk of taking too long to finish, leading to
+ * wraparound failure.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
+	bool		allindexes = true;
+
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
+	/* Precheck for XID wraparound emergencies */
+	if (vacuum_xid_limit_emergency(vacrel->relfrozenxid, vacrel->relminmxid))
+	{
+		/* Wraparound emergency -- don't even start an index scan */
+		return false;
+	}
+
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2125,25 +2328,43 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			vacrel->indstats[idx] =
 				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
 									  vacrel);
+
+			if (vacuum_xid_limit_emergency(vacrel->relfrozenxid,
+										   vacrel->relminmxid))
+			{
+				/* Wraparound emergency -- end current index scan */
+				allindexes = false;
+				break;
+			}
 		}
 	}
 	else
 	{
+		/* Note: parallel VACUUM only gets the precheck */
+		allindexes = true;
+
 		/* Outsource everything to parallel variant */
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
 	/*
 	 * We delete all LP_DEAD items from the first heap pass in all indexes on
-	 * each call here.  This makes call to lazy_vacuum_heap_rel() safe.
+	 * each call here (except calls where we don't finish all indexes).  This
+	 * makes call to lazy_vacuum_heap_rel() safe.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
 		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
 
-	/* Increase and report the number of index scans */
+	/*
+	 * Increase and report the number of index scans.  Note that we include
+	 * the case where we started a round index scanning that we weren't able
+	 * to finish.
+	 */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrel->num_index_scans);
+
+	return allindexes;
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..063113cd38 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_skip_index_age;
+int			vacuum_multixact_skip_index_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
@@ -1134,6 +1136,65 @@ vacuum_set_xid_limits(Relation rel,
 	}
 }
 
+/*
+ * vacuum_xid_limit_emergency() -- Handle wraparound emergencies
+ *
+ * Input parameters are the target relation's relfrozenxid and relminmxid.
+ */
+bool
+vacuum_xid_limit_emergency(TransactionId relfrozenxid, MultiXactId relminmxid)
+{
+	TransactionId xid_skip_limit;
+	MultiXactId	  multi_skip_limit;
+	int			  skip_index_vacuum;
+
+	Assert(TransactionIdIsNormal(relfrozenxid));
+	Assert(MultiXactIdIsValid(relminmxid));
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_skip_index_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(relfrozenxid) &&
+		TransactionIdPrecedes(relfrozenxid, xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_skip_index_age,
+							autovacuum_multixact_freeze_max_age * 1.05);
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(relminmxid) &&
+		MultiXactIdPrecedes(relminmxid, multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0c5dc4d3e8..24fb736a72 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2622,6 +2622,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_multixact_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3222,7 +3242,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_skip_index_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b234a6bfe6..7d6564e17f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -673,6 +673,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_skip_index_age = 1800000000
+#vacuum_multixact_skip_index_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..9a21e4a402 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8528,6 +8528,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-skip-index-age" xreflabel="vacuum_skip_index_age">
+      <term><varname>vacuum_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only anti-wraparound
+        autovacuums and aggressive scans have a chance to skip index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8574,6 +8599,32 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-vacuum-skip-index-age" xreflabel="vacuum_multixact_skip_index_age">
+      <term><varname>vacuum_multixact_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion multixacts. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..4d3674c1b4 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -607,8 +607,14 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
 
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
+    system will begin to skip index cleanup to hurry finishing vacuum
+    operation. <xref linkend="guc-vacuum-skip-index-age"/> controls when
+    <command>VACUUM</command> and autovacuum do that.
+   </para>
+
+    <para>
+     The system emits warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
 <programlisting>
 WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-- 
2.27.0

v8-0002-Break-lazy_scan_heap-up-into-functions.patchapplication/octet-stream; name=v8-0002-Break-lazy_scan_heap-up-into-functions.patchDownload

From 702de7c3cf081e860923644c6871b18741792aaa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v8 2/4] Break lazy_scan_heap() up into functions.

Aside from being useful cleanup work in its own right, this is also
preparation for an upcoming patch that removes the "tupgone" special
case from vacuumlazy.c.

The INDEX_CLEANUP=off case no longer uses the one-pass code path used
when vacuuming a table with no indexes.  It doesn't make sense to think
of the two cases as equivalent because only the no-indexes case can do
heap vacuuming.  The INDEX_CLEANUP=off case is now structured as a
two-pass VACUUM that opts to not do index vacuuming (and so naturally
cannot safely perform heap vacuuming).
---
 src/backend/access/heap/vacuumlazy.c  | 1403 +++++++++++++++----------
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    9 +-
 3 files changed, 835 insertions(+), 585 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9c1cfe42e1..72cb066e0a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -291,8 +291,9 @@ typedef struct LVRelState
 	Relation	onerel;
 	Relation   *indrels;
 	int			nindexes;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* Do index and/or heap vacuuming (don't skip them)? */
+	bool		do_index_vacuuming;
+	bool		do_index_cleanup;
 
 	/* Buffer access strategy and parallel state */
 	BufferAccessStrategy bstrategy;
@@ -351,6 +352,29 @@ typedef struct LVRelState
 	int64		nunused;		/* # existing unused line pointers */
 } LVRelState;
 
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * lazy_scan_prune()) that represents VM bit status.
+ *
+ * Used by lazy_scan_setvmbit() when we're done pruning.
+ */
+typedef struct LVPageVisMapState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVPageVisMapState;
+
+/*
+ * State output by lazy_scan_prune()
+ */
+typedef struct LVPagePruneState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPagePruneState;
+
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
 {
@@ -366,8 +390,21 @@ static int	elevel = -1;
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
+static bool lazy_scan_needs_freeze(Buffer buf, bool *hastup,
+								   LVRelState *vacrel);
+static void lazy_scan_new_page(LVRelState *vacrel, Buffer buf);
+static void lazy_scan_empty_page(LVRelState *vacrel, Buffer buf,
+								 Buffer vmbuffer);
+static void lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf,
+							   Buffer vmbuffer,
+							   LVPagePruneState *pageprunestate,
+							   LVPageVisMapState *pagevmstate);
+static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
+							GlobalVisState *vistest,
+							LVPagePruneState *pageprunestate,
+							LVPageVisMapState *pagevmstate,
+							VacOptTernaryValue index_cleanup);
+static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
@@ -386,13 +423,11 @@ static void update_index_statistics(LVRelState *vacrel);
 static bool should_attempt_truncation(LVRelState *vacrel,
 									  VacuumParams *params);
 static void lazy_truncate_heap(LVRelState *vacrel);
-static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
-								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
+static BlockNumber lazy_truncate_count_nondeletable(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
 static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
 							 BlockNumber relblocks);
@@ -517,8 +552,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrel->onerel = onerel;
 	vac_open_indexes(vacrel->onerel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
-	vacrel->useindex = (vacrel->nindexes > 0 &&
-						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->do_index_vacuuming = true;
+	vacrel->do_index_cleanup = true;
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+	{
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+	}
 	vacrel->bstrategy = bstrategy;
 	vacrel->lps = NULL;			/* for now */
 	vacrel->old_rel_pages = onerel->rd_rel->relpages;
@@ -810,8 +850,8 @@ vacuum_log_cleanup_info(LVRelState *vacrel)
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
  *		all-visible if appropriate.  When done, or when we run low on space
- *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
- *		pointers.
+ *		for dead-tuple TIDs, invoke lazy_vacuum to vacuum indexes and vacuum
+ *		heap relation during its own second pass over the heap.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -834,22 +874,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
-	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -859,6 +889,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	int64		initprog_val[3];
 	GlobalVisState *vistest;
 
+	/* Counters of # blocks in onerel: */
+	BlockNumber empty_pages,
+				vacuumed_pages;
+
 	pg_rusage_init(&ru0);
 
 	if (aggressive)
@@ -873,8 +907,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
 	next_unskippable_block = 0;
@@ -909,7 +941,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 */
 	lazy_space_alloc(vacrel, params->nworkers, nblocks);
 	dead_tuples = vacrel->dead_tuples;
-	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -994,20 +1025,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVPageVisMapState pagevmstate;
+		LVPagePruneState pageprunestate;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/*
+		 * Initialize vm state for page
+		 *
+		 * Can't touch pageprunestate for page until we reach
+		 * lazy_scan_prune(), though -- that's output state only
+		 */
+		pagevmstate.all_visible_according_to_vm = false;
+		pagevmstate.visibility_cutoff_xid = InvalidTransactionId;
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
@@ -1060,7 +1096,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
 											 &vmbuffer))
-				all_visible_according_to_vm = true;
+				pagevmstate.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1088,12 +1124,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			pagevmstate.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1112,24 +1151,18 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(vacrel);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap_rel(vacrel);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum(vacrel);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 * Even if we skipped heap vacuum, FSM vacuuming could be
+			 * worthwhile since we could have updated the freespace of empty
+			 * pages.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+									blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1138,6 +1171,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
@@ -1150,9 +1185,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1183,7 +1224,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+			if (!lazy_scan_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
 				vacrel->scanned_pages++;
@@ -1209,6 +1250,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrel->scanned_pages++;
 		vacrel->tupcount_pages++;
 
@@ -1216,396 +1263,81 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
 			empty_pages++;
-
-			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			lazy_scan_new_page(vacrel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
 			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			lazy_scan_empty_page(vacrel, buf, vmbuffer);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrel->latestRemovedXid,
-										 &vacrel->offnum);
+		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate,
+						params->index_cleanup);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before lazy_scan_prune()) then the page
+		 * will be visited again by lazy_vacuum_heap_rel(), which will compute
+		 * and record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (vacrel->nindexes > 0 && pageprunestate.has_lpdead_items &&
+			vacrel->do_index_vacuuming)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrel->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrel->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  vacrel->relfrozenxid,
-											  vacrel->relminmxid,
-											  vacrel->FreezeLimit,
-											  vacrel->MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrel->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+			/* Wait until lazy_vacuum_heap_rel() to save free space */
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(vacrel->onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(vacrel->onerel, buf,
-										 vacrel->FreezeLimit, frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/* Save space right away */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
+		if (vacrel->nindexes == 0 && pageprunestate.has_lpdead_items)
 		{
-			if (vacrel->nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap_rel().
 			 */
+			lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+			vacuumed_pages++;
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure lazy_scan_setvmbit() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
+			 */
+			pageprunestate.has_lpdead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1616,115 +1348,34 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
-										blkno);
+				FreeSpaceMapVacuumRange(vacrel->onerel,
+										next_fsm_block_to_vacuum, blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		lazy_scan_setvmbit(vacrel, buf, vmbuffer, &pageprunestate,
+						   &pagevmstate);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (pageprunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap_rel, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1733,16 +1384,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
 
-	pfree(frozen);
-
-	/* save stats for use later */
-	vacrel->tuples_deleted = tups_vacuumed;
-	vacrel->new_dead_tuples = nkeep;
-
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
 													 vacrel->tupcount_pages,
-													 live_tuples);
+													 vacrel->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1761,19 +1406,13 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
+	Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(vacrel);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap_rel(vacrel);
-	}
+		lazy_vacuum(vacrel);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we skipped index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
@@ -1783,29 +1422,34 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
 
 	/* Free resources managed by lazy_space_alloc() */
 	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
-	if (vacuumed_pages)
+	/*
+	 * If table has no indexes and at least one heap pages was vacuumed, make
+	 * log report that lazy_vacuum_heap_rel would've made had there been
+	 * indexes (having indexes implies using the two pass strategy).
+	 */
+	Assert(vacrel->nindexes == 0 || vacuumed_pages == 0);
+	if (vacuumed_pages > 0)
 		ereport(elevel,
-				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrel->relname,
-						tups_vacuumed, vacuumed_pages)));
+				(errmsg("\"%s\": removed %lld dead item identifiers in %u pages",
+						vacrel->relname, (long long) vacrel->lpdead_items,
+						vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
-					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, vacrel->OldestXmin);
-	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
+					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+	appendStringInfo(&buf, _("There were %lld unused item identifiers.\n"),
+					 (long long) vacrel->nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrel->pinskipped_pages),
@@ -1821,23 +1465,24 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": found %lld removable, %lld nonremovable row versions in %u out of %u pages",
 					vacrel->relname,
-					tups_vacuumed, num_tuples,
-					vacrel->scanned_pages, nblocks),
+					(long long) vacrel->tuples_deleted,
+					(long long) vacrel->num_tuples, vacrel->scanned_pages,
+					nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
+ *	lazy_scan_needs_freeze() -- see if any tuples need to be cleaned to avoid
+ *	wraparound
  *
  * Returns true if the page needs to be vacuumed using cleanup lock.
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
+lazy_scan_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -1869,7 +1514,9 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
-		/* this should match hastup test in count_nondeletable_pages() */
+		/*
+		 * This should match hastup test in lazy_truncate_count_nondeletable()
+		 */
 		if (ItemIdIsUsed(itemid))
 			*hastup = true;
 
@@ -1890,6 +1537,619 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 	return (offnum <= maxoff);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+lazy_scan_new_page(LVRelState *vacrel, Buffer buf)
+{
+	Relation	onerel = vacrel->onerel;
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+lazy_scan_empty_page(LVRelState *vacrel, Buffer buf, Buffer vmbuffer)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see lazy_scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
+				   LVPagePruneState *pageprunestate,
+				   LVPageVisMapState *pagevmstate)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (pageprunestate->all_visible &&
+		!pagevmstate->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (pageprunestate->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+						  pagevmstate->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel, blkno,
+													   &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && pageprunestate->has_lpdead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 pageprunestate->all_visible && pageprunestate->all_frozen &&
+			 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
+/*
+ *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
+				LVPagePruneState *pageprunestate,
+				LVPageVisMapState *pagevmstate,
+				VacOptTernaryValue index_cleanup)
+{
+	Relation	onerel = vacrel->onerel;
+	bool		tupgone;
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	ItemId		itemid;
+	HeapTupleData tuple;
+	int			tuples_deleted,
+				lpdead_items,
+				new_dead_tuples,
+				num_tuples,
+				live_tuples,
+				nunused;
+	int			nredirect PG_USED_FOR_ASSERTS_ONLY;
+	int			ntupoffsets;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	OffsetNumber tupoffsets[MaxHeapTuplesPerPage];
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+
+	/* Initialize (or reset) page-level counters */
+	tuples_deleted = 0;
+	lpdead_items = 0;
+	new_dead_tuples = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	nunused = 0;
+	nredirect = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as removed by VACUUM
+	 * (existing LP_DEAD line pointers don't count).
+	 */
+	tuples_deleted = heap_page_prune(onerel, buf, vistest,
+									 InvalidTransactionId, 0, false,
+									 &vacrel->latestRemovedXid,
+									 &vacrel->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	pageprunestate->hastup = false;
+	pageprunestate->has_lpdead_items = false;
+	pageprunestate->all_visible = true;
+	pageprunestate->all_frozen = true;
+	ntupoffsets = 0;
+	tupgone = false;
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Note: If you change anything in the loop below, also look at
+	 * heap_page_is_all_visible to see if that needs to be changed.
+	 */
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items require no processing, but we count 'em */
+		if (!ItemIdIsUsed(itemid))
+		{
+			nunused++;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			pageprunestate->hastup = true;	/* page won't be truncatable */
+			nredirect++;
+			continue;
+		}
+
+		/*
+		 * LP_DEAD line pointers are to be vacuumed normally; but we don't
+		 * count them in tuples_deleted, else we'd be double-counting (at
+		 * least in the common case where heap_page_prune() just freed up a
+		 * non-HOT tuple).
+		 *
+		 * We are usually able to log lpdead_items separately, though, which
+		 * shows a count of precisely these dead items -- items that we'll
+		 * delete from indexes.  It's treated as index-related
+		 * instrumentation.
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					new_dead_tuples++;
+				else
+					tupgone = true; /* we can delete the tuple */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				live_tuples++;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (pageprunestate->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin,
+											 pagevmstate->visibility_cutoff_xid))
+						pagevmstate->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				new_dead_tuples++;
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * This is an expected case during concurrent vacuum.
+				 *
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				pageprunestate->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				live_tuples++;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			/* Pretend that this is an LP_DEAD item  */
+			deadoffsets[lpdead_items++] = offnum;
+			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrel->latestRemovedXid);
+		}
+		else
+		{
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			tupoffsets[ntupoffsets++] = offnum;
+			num_tuples++;
+			pageprunestate->hastup = true;
+		}
+	}
+
+	/*
+	 * We have now divided every item on the page into either an LP_DEAD item
+	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
+	 * that remains and needs to be considered for freezing now (LP_UNUSED and
+	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Add page level counters to caller's counts, and then actually process
+	 * LP_DEAD and LP_NORMAL items.
+	 *
+	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
+	 * to pretend that DEAD items are LP_DEAD items.
+	 */
+	Assert(lpdead_items + ntupoffsets + nunused + nredirect == maxoff);
+	vacrel->offnum = InvalidOffsetNumber;
+
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+	vacrel->nunused += nunused;
+
+	/*
+	 * Consider the need to freeze any items with tuple storage from the page
+	 * first (arbitrary)
+	 */
+	if (ntupoffsets > 0)
+	{
+		xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+		int					 nfrozen = 0;
+
+		Assert(pageprunestate->hastup);
+
+		for (int i = 0; i < ntupoffsets; i++)
+		{
+			OffsetNumber item = tupoffsets[i];
+			bool		tuple_totally_frozen;
+
+			ItemPointerSet(&(tuple.t_self), blkno, item);
+			itemid = PageGetItemId(page, item);
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			Assert(ItemIdIsNormal(itemid) && ItemIdHasStorage(itemid));
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  vacrel->relfrozenxid,
+										  vacrel->relminmxid,
+										  vacrel->FreezeLimit,
+										  vacrel->MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = item;
+			if (!tuple_totally_frozen)
+				pageprunestate->all_frozen = false;
+		}
+
+		if (nfrozen > 0)
+		{
+			/*
+			 * At least one tuple with storage needs to be frozen -- execute
+			 * that now.
+			 *
+			 * If we need to freeze any tuples we'll mark the buffer dirty,
+			 * and write a WAL record recording the changes.  We must log the
+			 * changes to be crash-safe against future truncation of CLOG.
+			 */
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buf);
+
+			/* execute collected freezes */
+			for (int i = 0; i < nfrozen; i++)
+			{
+				HeapTupleHeader htup;
+
+				itemid = PageGetItemId(page, frozen[i].offset);
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				heap_execute_freeze_tuple(htup, &frozen[i]);
+			}
+
+			/* Now WAL-log freezing if necessary */
+			if (RelationNeedsWAL(vacrel->onerel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = log_heap_freeze(vacrel->onerel, buf, vacrel->FreezeLimit,
+										 frozen, nfrozen);
+				PageSetLSN(page, recptr);
+			}
+
+			END_CRIT_SECTION();
+		}
+	}
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array.  Also record that page has dead items in per-page prunestate.
+	 */
+	if (lpdead_items > 0)
+	{
+		LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+		ItemPointerData tmp;
+
+		pageprunestate->all_visible = false;
+		pageprunestate->has_lpdead_items = true;
+		vacrel->lpdead_item_pages++;
+
+		/*
+		 * Don't actually save item when it is known for sure that both index
+		 * vacuuming and heap vacuuming cannot go ahead during the ongoing VACUUM
+		 */
+		if (!vacrel->do_index_vacuuming && vacrel->nindexes > 0)
+			return;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_tuples->itemptrs[dead_tuples->num_tuples++] = tmp;
+		}
+
+		Assert(dead_tuples->num_tuples <= dead_tuples->max_tuples);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_tuples->num_tuples);
+	}
+}
+
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum(LVRelState *vacrel)
+{
+	/* Should not end up here with no indexes */
+	Assert(vacrel->nindexes > 0);
+	Assert(!IsParallelWorker());
+
+	if (!vacrel->do_index_vacuuming)
+	{
+		Assert(!vacrel->do_index_cleanup);
+		vacrel->dead_tuples->num_tuples = 0;
+		return;
+	}
+
+	/* Okay, we're going to do index vacuuming */
+	lazy_vacuum_all_indexes(vacrel);
+
+	/* Remove tuples from heap */
+	lazy_vacuum_heap_rel(vacrel);
+
+	/*
+	 * Forget the now-vacuumed tuples -- just press on
+	 */
+	vacrel->dead_tuples->num_tuples = 0;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
@@ -1897,6 +2157,8 @@ static void
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	Assert(vacrel->nindexes > 0);
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
@@ -2107,6 +2369,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
@@ -2186,6 +2452,8 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
+
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Update error traceback information */
@@ -2429,7 +2697,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(vacrel);
+		new_rel_pages = lazy_truncate_count_nondeletable(vacrel);
 		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
@@ -2478,7 +2746,7 @@ lazy_truncate_heap(LVRelState *vacrel)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(LVRelState *vacrel)
+lazy_truncate_count_nondeletable(LVRelState *vacrel)
 {
 	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
@@ -2618,14 +2886,14 @@ count_nondeletable_pages(LVRelState *vacrel)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2708,26 +2976,6 @@ lazy_space_free(LVRelState *vacrel)
 	end_parallel_vacuum(vacrel);
 }
 
-/*
- * lazy_record_dead_tuple - remember one deletable tuple
- */
-static void
-lazy_record_dead_tuple(LVDeadTuples *dead_tuples, ItemPointer itemptr)
-{
-	/*
-	 * The array shouldn't overflow under normal behavior, but perhaps it
-	 * could if we are given a really small maintenance_work_mem. In that
-	 * case, just forget the last few tuples (we'll get 'em next time).
-	 */
-	if (dead_tuples->num_tuples < dead_tuples->max_tuples)
-	{
-		dead_tuples->itemptrs[dead_tuples->num_tuples] = *itemptr;
-		dead_tuples->num_tuples++;
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_tuples->num_tuples);
-	}
-}
-
 /*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
@@ -2818,7 +3066,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * lazy_scan_new_page. So if you change anything here, also check that
+	 * code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2864,7 +3113,7 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in lazy_scan_new_page() */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..6bfc48c64a 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/lazy_scan_new_page nor heap_page_is_all_visible
+	 * will mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..adf4a61aac 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap and
+ * lazy_scan_new_page, but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,9 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * lazy_scan_new_page, we can afford to ignore problems and special
+		 * cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
-- 
2.27.0

#90

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#89)

4 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Mar 28, 2021 at 9:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

And now here's v8, which has the following additional cleanup:

And here's v9, which has improved commit messages for the first 2
patches, and many small tweaks within all 4 patches.

The most interesting change is that lazy_scan_heap() now has a fairly
elaborate assertion that verifies that its idea about whether or not
the page is all_visible and all_frozen is shared by
heap_page_is_all_visible() -- this is a stripped down version of the
logic that now lives in lazy_scan_heap(). It exists so that the second
pass over the heap can set visibility map bits.

--
Peter Geoghegan

Attachments:

v9-0001-Simplify-state-managed-by-VACUUM.patchapplication/octet-stream; name=v9-0001-Simplify-state-managed-by-VACUUM.patchDownload

From 5f5e4275e3feb8f48a735316e03ad4976d200b57 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:54 -0700
Subject: [PATCH v9 1/4] Simplify state managed by VACUUM.

Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand.  Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead.  Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output.  We were using double, but that doesn't seem to have any
advantages.  Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap.  These assertions will be added in later
commits.

Finally, reorder functions so that functions that contain important and
essential steps for VACUUM appear before less important functions.  Also
try to order related functions based on the order on which they're
called during VACUUM.
---
 src/include/access/genam.h           |    4 +-
 src/backend/access/heap/vacuumlazy.c | 2206 +++++++++++++-------------
 src/backend/access/index/indexam.c   |    8 +-
 3 files changed, 1147 insertions(+), 1071 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..480a4762f5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,11 +177,11 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
-												IndexBulkDeleteResult *stats,
+												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
 												void *callback_state);
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
-												   IndexBulkDeleteResult *stats);
+												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..e8d56fa060 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -142,12 +142,6 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 
-/*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-
 /* Phases of vacuum during which we report error context. */
 typedef enum
 {
@@ -160,9 +154,10 @@ typedef enum
 } VacErrPhase;
 
 /*
- * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
- * This is allocated in the DSM segment in parallel mode and in local memory
- * in non-parallel mode.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
  */
 typedef struct LVDeadTuples
 {
@@ -191,7 +186,7 @@ typedef struct LVShared
 	 * Target table relid and log level.  These fields are not modified during
 	 * the lazy vacuum.
 	 */
-	Oid			relid;
+	Oid			onereloid;
 	int			elevel;
 
 	/*
@@ -264,7 +259,7 @@ typedef struct LVShared
 typedef struct LVSharedIndStats
 {
 	bool		updated;		/* are the stats updated? */
-	IndexBulkDeleteResult stats;
+	IndexBulkDeleteResult istat;
 } LVSharedIndStats;
 
 /* Struct for maintaining a parallel vacuum state. */
@@ -290,41 +285,69 @@ typedef struct LVParallelState
 	int			nindexes_parallel_condcleanup;
 } LVParallelState;
 
-typedef struct LVRelStats
+typedef struct LVRelState
 {
-	char	   *relnamespace;
-	char	   *relname;
+	/* Target heap relation and its indexes */
+	Relation	onerel;
+	Relation   *indrels;
+	int			nindexes;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
-	/* Overall statistics about rel */
+
+	/* Buffer access strategy and parallel state */
+	BufferAccessStrategy bstrategy;
+	LVParallelState *lps;
+
+	/* Statistics from pg_class when we start out */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages we skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
 	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
-	double		new_dead_tuples;	/* new estimated total # of dead tuples */
-	BlockNumber pages_removed;
-	double		tuples_deleted;
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-	LVDeadTuples *dead_tuples;
-	int			num_index_scans;
+	/* onerel's initial relfrozenxid and relminmxid */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
 	TransactionId latestRemovedXid;
-	bool		lock_waiter_detected;
 
-	/* Statistics about indexes */
-	IndexBulkDeleteResult **indstats;
-	int			nindexes;
+	/* VACUUM operation's cutoff for pruning */
+	TransactionId OldestXmin;
+	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
-	/* Used for error callback */
+	/* Error reporting state */
+	char	   *relnamespace;
+	char	   *relname;
 	char	   *indname;
 	BlockNumber blkno;			/* used only for heap operations */
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
-} LVRelStats;
+
+	/*
+	 * State managed by lazy_scan_heap() follows
+	 */
+	LVDeadTuples *dead_tuples;	/* items to vacuum from indexes */
+	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber scanned_pages;	/* number of pages we examined */
+	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
+	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
+	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber pages_removed;	/* pages remove by truncation */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	bool		lock_waiter_detected;
+
+	/* Statistics output by us, for table */
+	double		new_rel_tuples; /* new estimated total # of tuples */
+	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* Statistics output by index AMs */
+	IndexBulkDeleteResult **indstats;
+
+	/* Instrumentation counters */
+	int			num_index_scans;
+	int64		tuples_deleted; /* # deleted from table */
+	int64		new_dead_tuples;	/* new estimated total # of dead items in
+									 * table */
+	int64		num_tuples;		/* total number of nonremovable tuples */
+	int64		live_tuples;	/* live tuples (reltuples estimate) */
+	int64		nunused;		/* # existing unused line pointers */
+} LVRelState;
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -334,77 +357,72 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
-/* A few variables that don't seem worth passing around as parameters */
+/* elevel controls whole VACUUM's verbosity */
 static int	elevel = -1;
 
-static TransactionId OldestXmin;
-static TransactionId FreezeLimit;
-static MultiXactId MultiXactCutoff;
-
-static BufferAccessStrategy vac_strategy;
-
 
 /* non-export function prototypes */
-static void lazy_scan_heap(Relation onerel, VacuumParams *params,
-						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
-static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_index(Relation indrel,
-							   IndexBulkDeleteResult **stats,
-							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-							 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
-static bool should_attempt_truncation(VacuumParams *params,
-									  LVRelStats *vacrelstats);
-static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
-static BlockNumber count_nondeletable_pages(Relation onerel,
-											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+									LVRelState *vacrel);
+static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
+													IndexBulkDeleteResult *istat,
+													double reltuples,
+													LVRelState *vacrel);
+static void lazy_cleanup_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_cleanup_one_index(Relation indrel,
+													 IndexBulkDeleteResult *istat,
+													 double reltuples,
+													 bool estimated_count,
+													 LVRelState *vacrel);
+static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, int tupindex, Buffer *vmbuffer);
+static void update_index_statistics(LVRelState *vacrel);
+static bool should_attempt_truncation(LVRelState *vacrel,
+									  VacuumParams *params);
+static void lazy_truncate_heap(LVRelState *vacrel);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 LVRelStats *vacrelstats,
+static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static void lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-										 LVParallelState *lps, int nindexes);
-static void parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-								  LVDeadTuples *dead_tuples, int nindexes,
-								  LVRelStats *vacrelstats);
-static void vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-								  LVParallelState *lps, int nindexes);
-static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-									 LVParallelState *lps, int nindexes);
+static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
-static int	compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber relblocks);
+static void lazy_space_free(LVRelState *vacrel);
+static int	compute_parallel_vacuum_workers(LVRelState *vacrel,
+											int nrequested,
 											bool *can_parallel_vacuum);
-static void prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-									 int nindexes);
-static void update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-									int nindexes);
-static LVParallelState *begin_parallel_vacuum(Oid relid, Relation *Irel,
-											  LVRelStats *vacrelstats, BlockNumber nblocks,
-											  int nindexes, int nrequested);
-static void end_parallel_vacuum(IndexBulkDeleteResult **stats,
-								LVParallelState *lps, int nindexes);
-static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
-static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static LVParallelState *begin_parallel_vacuum(LVRelState *vacrel,
+											  BlockNumber nblocks,
+											  int nrequested);
+static void end_parallel_vacuum(LVRelState *vacrel);
+static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
+static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
+static void do_parallel_processing(LVRelState *vacrel,
+								   LVShared *lvshared);
+static void do_serial_processing_for_unsafe_indexes(LVRelState *vacrel,
+													LVShared *lvshared);
+static IndexBulkDeleteResult *parallel_process_one_index(Relation indrel,
+														 IndexBulkDeleteResult *istat,
+														 LVShared *lvshared,
+														 LVSharedIndStats *shared_indstats,
+														 LVRelState *vacrel);
+static LVSharedIndStats *parallel_stats_for_idx(LVShared *lvshared, int getidx);
+static bool parallel_processing_is_safe(Relation indrel, LVShared *lvshared);
 static void vacuum_error_callback(void *arg);
-static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info,
+static void update_vacuum_error_info(LVRelState *vacrel,
+									 LVSavedErrInfo *saved_vacrel,
 									 int phase, BlockNumber blkno,
 									 OffsetNumber offnum);
-static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info);
+static void restore_vacuum_error_info(LVRelState *vacrel,
+									  const LVSavedErrInfo *saved_vacrel);
 
 
 /*
@@ -420,9 +438,7 @@ void
 heap_vacuum_rel(Relation onerel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
-	LVRelStats *vacrelstats;
-	Relation   *Irel;
-	int			nindexes;
+	LVRelState *vacrel;
 	PGRUsage	ru0;
 	TimestampTz starttime = 0;
 	WalUsage	walusage_start = pgWalUsage;
@@ -444,15 +460,14 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
+	TransactionId OldestXmin;
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -473,8 +488,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(onerel));
 
-	vac_strategy = bstrategy;
-
 	vacuum_set_xid_limits(onerel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
@@ -496,35 +509,40 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
-	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
+	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
-	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats->relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats->indname = NULL;
-	vacrelstats->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
-	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
-	vacrelstats->old_live_tuples = onerel->rd_rel->reltuples;
-	vacrelstats->num_index_scans = 0;
-	vacrelstats->pages_removed = 0;
-	vacrelstats->lock_waiter_detected = false;
+	/* Set up high level stuff about onerel */
+	vacrel->onerel = onerel;
+	vac_open_indexes(vacrel->onerel, RowExclusiveLock, &vacrel->nindexes,
+					 &vacrel->indrels);
+	vacrel->useindex = (vacrel->nindexes > 0 &&
+						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->bstrategy = bstrategy;
+	vacrel->lps = NULL;			/* for now */
+	vacrel->old_rel_pages = onerel->rd_rel->relpages;
+	vacrel->old_live_tuples = onerel->rd_rel->reltuples;
+	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
+	vacrel->relminmxid = onerel->rd_rel->relminmxid;
+	vacrel->latestRemovedXid = InvalidTransactionId;
 
-	/* Open all indexes of the relation */
-	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	/* Set cutoffs for entire VACUUM */
+	vacrel->OldestXmin = OldestXmin;
+	vacrel->FreezeLimit = FreezeLimit;
+	vacrel->MultiXactCutoff = MultiXactCutoff;
 
-	vacrelstats->indstats = (IndexBulkDeleteResult **)
-		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
-	vacrelstats->nindexes = nindexes;
+	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel->relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 
 	/* Save index names iff autovacuum logging requires it */
-	if (IsAutoVacuumWorkerProcess() &&
-		params->log_min_duration >= 0 &&
-		vacrelstats->nindexes > 0)
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0 &&
+		vacrel->nindexes > 0)
 	{
-		indnames = palloc(sizeof(char *) * vacrelstats->nindexes);
-		for (int i = 0; i < vacrelstats->nindexes; i++)
-			indnames[i] = pstrdup(RelationGetRelationName(Irel[i]));
+		indnames = palloc(sizeof(char *) * vacrel->nindexes);
+		for (int i = 0; i < vacrel->nindexes; i++)
+			indnames[i] =
+				pstrdup(RelationGetRelationName(vacrel->indrels[i]));
 	}
 
 	/*
@@ -539,15 +557,15 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * information is restored at the end of those phases.
 	 */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = vacrelstats;
+	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(vacrel, params, aggressive);
 
 	/* Done with indexes */
-	vac_close_indexes(nindexes, Irel, NoLock);
+	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
 	 * Compute whether we actually scanned the all unfrozen pages. If we did,
@@ -556,8 +574,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
 	 */
-	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
-		< vacrelstats->rel_pages)
+	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
+		< vacrel->rel_pages)
 	{
 		Assert(!aggressive);
 		scanned_all_unfrozen = false;
@@ -568,17 +586,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/*
 	 * Optionally truncate the relation.
 	 */
-	if (should_attempt_truncation(params, vacrelstats))
+	if (should_attempt_truncation(vacrel, params))
 	{
 		/*
 		 * Update error traceback information.  This is the last phase during
 		 * which we add context information to errors, so we don't need to
 		 * revert to the previous phase.
 		 */
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-								 vacrelstats->nonempty_pages,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
+								 vacrel->nonempty_pages,
 								 InvalidOffsetNumber);
-		lazy_truncate_heap(onerel, vacrelstats);
+		lazy_truncate_heap(vacrel);
 	}
 
 	/* Pop the error context stack */
@@ -602,8 +620,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
 	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrelstats->rel_pages;
-	new_live_tuples = vacrelstats->new_live_tuples;
+	new_rel_pages = vacrel->rel_pages;
+	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(onerel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
@@ -616,7 +634,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 						new_rel_pages,
 						new_live_tuples,
 						new_rel_allvisible,
-						nindexes > 0,
+						vacrel->nindexes > 0,
 						new_frozen_xid,
 						new_min_multi,
 						false);
@@ -625,7 +643,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(onerel),
 						 onerel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrelstats->new_dead_tuples);
+						 vacrel->new_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -676,39 +694,39 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			}
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
-							 vacrelstats->relnamespace,
-							 vacrelstats->relname,
-							 vacrelstats->num_index_scans);
+							 vacrel->relnamespace,
+							 vacrel->relname,
+							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
-							 vacrelstats->pages_removed,
-							 vacrelstats->rel_pages,
-							 vacrelstats->pinskipped_pages,
-							 vacrelstats->frozenskipped_pages);
+							 vacrel->pages_removed,
+							 vacrel->rel_pages,
+							 vacrel->pinskipped_pages,
+							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"),
-							 vacrelstats->tuples_deleted,
-							 vacrelstats->new_rel_tuples,
-							 vacrelstats->new_dead_tuples,
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->new_rel_tuples,
+							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
 			appendStringInfo(&buf,
 							 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
-			for (int i = 0; i < vacrelstats->nindexes; i++)
+			for (int i = 0; i < vacrel->nindexes; i++)
 			{
-				IndexBulkDeleteResult *stats = vacrelstats->indstats[i];
+				IndexBulkDeleteResult *istat = vacrel->indstats[i];
 
-				if (!stats)
+				if (!istat)
 					continue;
 
 				appendStringInfo(&buf,
 								 _("index \"%s\": pages: %u in total, %u newly deleted, %u currently deleted, %u reusable\n"),
 								 indnames[i],
-								 stats->num_pages,
-								 stats->pages_newly_deleted,
-								 stats->pages_deleted,
-								 stats->pages_free);
+								 istat->num_pages,
+								 istat->pages_newly_deleted,
+								 istat->pages_deleted,
+								 istat->pages_free);
 			}
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 							 read_rate, write_rate);
@@ -737,10 +755,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/* Cleanup index statistics and index names */
-	for (int i = 0; i < vacrelstats->nindexes; i++)
+	for (int i = 0; i < vacrel->nindexes; i++)
 	{
-		if (vacrelstats->indstats[i])
-			pfree(vacrelstats->indstats[i]);
+		if (vacrel->indstats[i])
+			pfree(vacrel->indstats[i]);
 
 		if (indnames && indnames[i])
 			pfree(indnames[i]);
@@ -764,20 +782,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  * which would be after the rows have become inaccessible.
  */
 static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+vacuum_log_cleanup_info(LVRelState *vacrel)
 {
 	/*
 	 * Skip this for relations for which no WAL is to be written, or if we're
 	 * not trying to support archive recovery.
 	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
+	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
 		return;
 
 	/*
 	 * No need to write the record at all unless it contains a valid value
 	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+	if (TransactionIdIsValid(vacrel->latestRemovedXid))
+		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
+									 vacrel->latestRemovedXid);
 }
 
 /*
@@ -788,9 +807,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
+ *		pointers.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -809,16 +828,12 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		reference them have been killed.
  */
 static void
-lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
-	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
@@ -847,63 +862,47 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 	else
 		ereport(elevel,
 				(errmsg("vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
-	nblocks = RelationGetNumberOfBlocks(onerel);
-	vacrelstats->rel_pages = nblocks;
-	vacrelstats->scanned_pages = 0;
-	vacrelstats->tupcount_pages = 0;
-	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
+	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
+	vacrel->rel_pages = nblocks;
+	vacrel->scanned_pages = 0;
+	vacrel->pinskipped_pages = 0;
+	vacrel->frozenskipped_pages = 0;
+	vacrel->tupcount_pages = 0;
+	vacrel->pages_removed = 0;
+	vacrel->nonempty_pages = 0;
+	vacrel->lock_waiter_detected = false;
 
-	vistest = GlobalVisTestFor(onerel);
+	/* Initialize instrumentation counters */
+	vacrel->num_index_scans = 0;
+	vacrel->tuples_deleted = 0;
+	vacrel->new_dead_tuples = 0;
+	vacrel->num_tuples = 0;
+	vacrel->live_tuples = 0;
+	vacrel->nunused = 0;
 
-	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
-	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
-	{
-		/*
-		 * Since parallel workers cannot access data in temporary tables, we
-		 * can't perform parallel vacuum on them.
-		 */
-		if (RelationUsesLocalBuffers(onerel))
-		{
-			/*
-			 * Give warning only if the user explicitly tries to perform a
-			 * parallel vacuum on the temporary table.
-			 */
-			if (params->nworkers > 0)
-				ereport(WARNING,
-						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
-								vacrelstats->relname)));
-		}
-		else
-			lps = begin_parallel_vacuum(RelationGetRelid(onerel), Irel,
-										vacrelstats, nblocks, nindexes,
-										params->nworkers);
-	}
+	vistest = GlobalVisTestFor(vacrel->onerel);
+
+	vacrel->indstats = (IndexBulkDeleteResult **)
+		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
 	/*
 	 * Allocate the space for dead tuples in case parallel vacuum is not
 	 * initialized.
 	 */
-	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
-
-	dead_tuples = vacrelstats->dead_tuples;
+	lazy_space_alloc(vacrel, params->nworkers, nblocks);
+	dead_tuples = vacrel->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
@@ -963,7 +962,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		{
 			uint8		vmstatus;
 
-			vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+			vmstatus = visibilitymap_get_status(vacrel->onerel,
+												next_unskippable_block,
 												&vmbuffer);
 			if (aggressive)
 			{
@@ -1004,11 +1004,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* see note above about forcing scanning of last page */
 #define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
+		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
 		if (blkno == next_unskippable_block)
@@ -1021,7 +1021,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				{
 					uint8		vmskipflags;
 
-					vmskipflags = visibilitymap_get_status(onerel,
+					vmskipflags = visibilitymap_get_status(vacrel->onerel,
 														   next_unskippable_block,
 														   &vmbuffer);
 					if (aggressive)
@@ -1053,7 +1053,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
+											 &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
@@ -1077,8 +1078,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * know whether it was all-frozen, so we have to recheck; but
 				 * in this case an approximate answer is OK.
 				 */
-				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-					vacrelstats->frozenskipped_pages++;
+				if (aggressive || VM_ALL_FROZEN(vacrel->onerel, blkno,
+												&vmbuffer))
+					vacrel->frozenskipped_pages++;
 				continue;
 			}
 			all_visible_according_to_vm = true;
@@ -1106,10 +1108,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+			lazy_vacuum_all_indexes(vacrel);
 
 			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
+			lazy_vacuum_heap_rel(vacrel);
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1122,7 +1124,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1137,12 +1139,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
-		visibilitymap_pin(onerel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->onerel, blkno, &vmbuffer);
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, vacrel->bstrategy);
 
 		/* We need buffer cleanup lock so that we can prune HOT chains. */
 		if (!ConditionalLockBufferForCleanup(buf))
@@ -1156,7 +1157,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			if (!aggressive && !FORCE_CHECK_PAGE())
 			{
 				ReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				continue;
 			}
 
@@ -1177,13 +1178,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrelstats))
+			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
-				vacrelstats->scanned_pages++;
-				vacrelstats->pinskipped_pages++;
+				vacrel->scanned_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			if (!aggressive)
@@ -1193,9 +1194,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * to claiming that the page contains no freezable tuples.
 				 */
 				UnlockReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -1203,8 +1204,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
-		vacrelstats->scanned_pages++;
-		vacrelstats->tupcount_pages++;
+		vacrel->scanned_pages++;
+		vacrel->tupcount_pages++;
 
 		page = BufferGetPage(buf);
 
@@ -1233,12 +1234,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			empty_pages++;
 
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
+			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
 			{
 				Size		freespace;
 
 				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
+				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			}
 			continue;
 		}
@@ -1269,19 +1270,19 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (RelationNeedsWAL(vacrel->onerel) &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 								  vmbuffer, InvalidTransactionId,
 								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
 				END_CRIT_SECTION();
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 			continue;
 		}
 
@@ -1291,10 +1292,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * We count tuples removed by the pruning step as removed by VACUUM
 		 * (existing LP_DEAD line pointers don't count).
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
+		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
 										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+										 &vacrel->latestRemovedXid,
+										 &vacrel->offnum);
 
 		/*
 		 * Now scan the page to collect vacuumable items and check for tuples
@@ -1321,7 +1322,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Set the offset number so that we can display it along with any
 			 * error that occurred while processing this tuple.
 			 */
-			vacrelstats->offnum = offnum;
+			vacrel->offnum = offnum;
 			itemid = PageGetItemId(page, offnum);
 
 			/* Unused items require no processing, but we count 'em */
@@ -1361,7 +1362,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
+			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
 			tupgone = false;
 
@@ -1376,7 +1377,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * cases impossible (e.g. in-progress insert from the same
 			 * transaction).
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 			{
 				case HEAPTUPLE_DEAD:
 
@@ -1446,7 +1447,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						 * enough that everyone sees it as committed?
 						 */
 						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
+						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 						{
 							all_visible = false;
 							break;
@@ -1500,7 +1501,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			{
 				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
 				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
+													   &vacrel->latestRemovedXid);
 				tups_vacuumed += 1;
 				has_dead_items = true;
 			}
@@ -1516,8 +1517,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
 				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
+											  vacrel->relfrozenxid,
+											  vacrel->relminmxid,
+											  vacrel->FreezeLimit,
+											  vacrel->MultiXactCutoff,
 											  &frozen[nfrozen],
 											  &tuple_totally_frozen))
 					frozen[nfrozen++].offset = offnum;
@@ -1531,7 +1534,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * Clear the offset information once we have processed all the tuples
 		 * on the page.
 		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
+		vacrel->offnum = InvalidOffsetNumber;
 
 		/*
 		 * If we froze any tuples, mark the buffer dirty, and write a WAL
@@ -1557,12 +1560,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (RelationNeedsWAL(vacrel->onerel))
 			{
 				XLogRecPtr	recptr;
 
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
+				recptr = log_heap_freeze(vacrel->onerel, buf,
+										 vacrel->FreezeLimit, frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
 
@@ -1574,12 +1577,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * doing a second scan. Also we don't do that but forget dead tuples
 		 * when index cleanup is disabled.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
+			if (vacrel->nindexes == 0)
 			{
 				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
 				vacuumed_pages++;
 				has_dead_items = false;
 			}
@@ -1589,11 +1592,6 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * Here, we have indexes but index cleanup is disabled.
 				 * Instead of vacuuming the dead tuples on the heap, we just
 				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
 				 */
 				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
 			}
@@ -1613,7 +1611,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
@@ -1644,7 +1642,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, visibility_cutoff_xid, flags);
 		}
 
@@ -1656,11 +1654,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * that something bad has happened.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+				 vacrel->relname, blkno);
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1682,10 +1680,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		else if (PageIsAllVisible(page) && has_dead_items)
 		{
 			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
+				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1695,14 +1693,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * all_visible is true, so we must check both.
 		 */
 		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
 			 * because setting the all-frozen bit doesn't cause recovery
 			 * conflicts.
 			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1711,43 +1709,42 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (hastup)
-			vacrelstats->nonempty_pages = blkno + 1;
+			vacrel->nonempty_pages = blkno + 1;
 
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
+		 * visited again by lazy_vacuum_heap_rel, which will compute and record
 		 * its post-compaction free space.  If not, then we're done with this
 		 * page, so remember its free space as-is.  (This path will always be
 		 * taken if there are no indexes.)
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
 	}
 
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
+	vacrel->blkno = InvalidBlockNumber;
 
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrel->tuples_deleted = tups_vacuumed;
+	vacrel->new_dead_tuples = nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
-	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
-														  nblocks,
-														  vacrelstats->tupcount_pages,
-														  live_tuples);
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
+													 vacrel->tupcount_pages,
+													 live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
-	vacrelstats->new_rel_tuples =
-		Max(vacrelstats->new_live_tuples, 0) + vacrelstats->new_dead_tuples;
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1763,10 +1760,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (dead_tuples->num_tuples > 0)
 	{
 		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+		lazy_vacuum_all_indexes(vacrel);
 
 		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
+		lazy_vacuum_heap_rel(vacrel);
 	}
 
 	/*
@@ -1774,47 +1771,44 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * not there were indexes.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+								blkno);
 
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
-		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
+	if (vacrel->useindex)
+		lazy_cleanup_all_indexes(vacrel);
 
-	/*
-	 * End parallel mode before updating index statistics as we cannot write
-	 * during parallel mode.
-	 */
-	if (ParallelVacuumIsActive(lps))
-		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
+	/* Free resources managed by lazy_space_alloc() */
+	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrelstats->useindex)
-		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
+	if (vacrel->useindex)
+		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
+	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						tups_vacuumed, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 nkeep, vacrel->OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
 					 nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
-									vacrelstats->pinskipped_pages),
-					 vacrelstats->pinskipped_pages);
+									vacrel->pinskipped_pages),
+					 vacrel->pinskipped_pages);
 	appendStringInfo(&buf, ngettext("%u frozen page.\n",
 									"%u frozen pages.\n",
-									vacrelstats->frozenskipped_pages),
-					 vacrelstats->frozenskipped_pages);
+									vacrel->frozenskipped_pages),
+					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
@@ -1823,258 +1817,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-					vacrelstats->relname,
+					vacrel->relname,
 					tups_vacuumed, num_tuples,
-					vacrelstats->scanned_pages, nblocks),
+					vacrel->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
-/*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
- *
- * We process the indexes serially unless we are doing parallel vacuum.
- */
-static void
-lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
-{
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
-
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
-
-	/* Report that we are now vacuuming indexes */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
-
-	/* Perform index vacuuming with parallel workers for parallel vacuum. */
-	if (ParallelVacuumIsActive(lps))
-	{
-		/* Tell parallel workers to do index vacuuming */
-		lps->lvshared->for_cleanup = false;
-		lps->lvshared->first_time = false;
-
-		/*
-		 * We can only provide an approximate value of num_heap_tuples in
-		 * vacuum cases.
-		 */
-		lps->lvshared->reltuples = vacrelstats->old_live_tuples;
-		lps->lvshared->estimated_count = true;
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
-	}
-	else
-	{
-		int			idx;
-
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							  vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
-	}
-
-	/* Increase and report the number of index scans */
-	vacrelstats->num_index_scans++;
-	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
-								 vacrelstats->num_index_scans);
-}
-
-
-/*
- *	lazy_vacuum_heap() -- second pass over the heap
- *
- *		This routine marks dead tuples as unused and compacts out free
- *		space on their pages.  Pages not having dead tuples recorded from
- *		lazy_scan_heap are not visited at all.
- *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
- */
-static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
-{
-	int			tupindex;
-	int			npages;
-	PGRUsage	ru0;
-	Buffer		vmbuffer = InvalidBuffer;
-	LVSavedErrInfo saved_err_info;
-
-	/* Report that we are now vacuuming the heap */
-	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 InvalidBlockNumber, InvalidOffsetNumber);
-
-	pg_rusage_init(&ru0);
-	npages = 0;
-
-	tupindex = 0;
-	while (tupindex < vacrelstats->dead_tuples->num_tuples)
-	{
-		BlockNumber tblk;
-		Buffer		buf;
-		Page		page;
-		Size		freespace;
-
-		vacuum_delay_point();
-
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]);
-		vacrelstats->blkno = tblk;
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
-								 vac_strategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
-									&vmbuffer);
-
-		/* Now that we've compacted the page, record its available space */
-		page = BufferGetPage(buf);
-		freespace = PageGetHeapFreeSpace(page);
-
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace);
-		npages++;
-	}
-
-	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
-
-	if (BufferIsValid(vmbuffer))
-	{
-		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	ereport(elevel,
-			(errmsg("\"%s\": removed %d row versions in %d pages",
-					vacrelstats->relname,
-					tupindex, npages),
-			 errdetail_internal("%s", pg_rusage_show(&ru0))));
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-}
-
-/*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
- *
- * Caller must hold pin and buffer cleanup lock on the buffer.
- *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
- * tuple for this page.  We assume the rest follow sequentially.
- * The return value is the first tupindex after the tuples of this page.
- */
-static int
-lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
-{
-	LVDeadTuples *dead_tuples = vacrelstats->dead_tuples;
-	Page		page = BufferGetPage(buffer);
-	OffsetNumber unused[MaxOffsetNumber];
-	int			uncnt = 0;
-	TransactionId visibility_cutoff_xid;
-	bool		all_frozen;
-	LVSavedErrInfo saved_err_info;
-
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-
-	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 blkno, InvalidOffsetNumber);
-
-	START_CRIT_SECTION();
-
-	for (; tupindex < dead_tuples->num_tuples; tupindex++)
-	{
-		BlockNumber tblk;
-		OffsetNumber toff;
-		ItemId		itemid;
-
-		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
-		itemid = PageGetItemId(page, toff);
-		ItemIdSetUnused(itemid);
-		unused[uncnt++] = toff;
-	}
-
-	PageRepairFragmentation(page);
-
-	/*
-	 * Mark buffer dirty before we write WAL.
-	 */
-	MarkBufferDirty(buffer);
-
-	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
-	{
-		XLogRecPtr	recptr;
-
-		recptr = log_heap_clean(onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrelstats->latestRemovedXid);
-		PageSetLSN(page, recptr);
-	}
-
-	/*
-	 * End critical section, so we safely can do visibility tests (which
-	 * possibly need to perform IO and allocate memory!). If we crash now the
-	 * page (including the corresponding vm bit) might not be marked all
-	 * visible, but that's fine. A later vacuum will fix that.
-	 */
-	END_CRIT_SECTION();
-
-	/*
-	 * Now that we have removed the dead tuples from the page, once again
-	 * check if the page has become all-visible.  The page is already marked
-	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
-	 */
-	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
-								 &visibility_cutoff_xid,
-								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
-	{
-		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
-		uint8		flags = 0;
-
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
-			flags |= VISIBILITYMAP_ALL_FROZEN;
-
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
-	}
-
-	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	return tupindex;
-}
-
 /*
  *	lazy_check_needs_freeze() -- scan page to see if any tuples
  *					 need to be cleaned to avoid wraparound
@@ -2083,7 +1832,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
+lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -2112,7 +1861,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* this should match hastup test in count_nondeletable_pages() */
@@ -2125,363 +1874,72 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 
-		if (heap_tuple_needs_freeze(tupleheader, FreezeLimit,
-									MultiXactCutoff, buf))
+		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
 			break;
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return (offnum <= maxoff);
 }
 
 /*
- * Perform index vacuum or index cleanup with parallel workers.  This function
- * must be used by the parallel vacuum leader process.  The caller must set
- * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
- * cleanup.
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
 static void
-lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-							 LVParallelState *lps, int nindexes)
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-	int			nworkers;
+	Assert(vacrel->nindexes > 0);
+	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	Assert(!IsParallelWorker());
-	Assert(ParallelVacuumIsActive(lps));
-	Assert(nindexes > 0);
+	/* Log cleanup info before we touch indexes */
+	vacuum_log_cleanup_info(vacrel);
 
-	/* Determine the number of parallel workers to launch */
-	if (lps->lvshared->for_cleanup)
-	{
-		if (lps->lvshared->first_time)
-			nworkers = lps->nindexes_parallel_cleanup +
-				lps->nindexes_parallel_condcleanup;
-		else
-			nworkers = lps->nindexes_parallel_cleanup;
-	}
-	else
-		nworkers = lps->nindexes_parallel_bulkdel;
-
-	/* The leader process will participate */
-	nworkers--;
-
-	/*
-	 * It is possible that parallel context is initialized with fewer workers
-	 * than the number of indexes that need a separate worker in the current
-	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
-	 */
-	nworkers = Min(nworkers, lps->pcxt->nworkers);
-
-	/* Setup the shared cost-based vacuum delay and launch workers */
-	if (nworkers > 0)
-	{
-		if (vacrelstats->num_index_scans > 0)
-		{
-			/* Reset the parallel index processing counter */
-			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
-
-			/* Reinitialize the parallel context to relaunch parallel workers */
-			ReinitializeParallelDSM(lps->pcxt);
-		}
-
-		/*
-		 * Set up shared cost balance and the number of active workers for
-		 * vacuum delay.  We need to do this before launching workers as
-		 * otherwise, they might not see the updated values for these
-		 * parameters.
-		 */
-		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
-		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
-
-		/*
-		 * The number of workers can vary between bulkdelete and cleanup
-		 * phase.
-		 */
-		ReinitializeParallelWorkers(lps->pcxt, nworkers);
-
-		LaunchParallelWorkers(lps->pcxt);
-
-		if (lps->pcxt->nworkers_launched > 0)
-		{
-			/*
-			 * Reset the local cost values for leader backend as we have
-			 * already accumulated the remaining balance of heap.
-			 */
-			VacuumCostBalance = 0;
-			VacuumCostBalanceLocal = 0;
-
-			/* Enable shared cost balance for leader backend */
-			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
-			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
-		}
-
-		if (lps->lvshared->for_cleanup)
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
-									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-		else
-			ereport(elevel,
-					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
-									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
-									 lps->pcxt->nworkers_launched),
-							lps->pcxt->nworkers_launched, nworkers)));
-	}
-
-	/* Process the indexes that can be processed by only leader process */
-	vacuum_indexes_leader(Irel, vacrelstats, lps, nindexes);
-
-	/*
-	 * Join as a parallel worker.  The leader process alone processes all the
-	 * indexes in the case where no workers are launched.
-	 */
-	parallel_vacuum_index(Irel, lps->lvshared, vacrelstats->dead_tuples,
-						  nindexes, vacrelstats);
-
-	/*
-	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
-	 * to finish, or we might get incomplete data.)
-	 */
-	if (nworkers > 0)
-	{
-		int			i;
-
-		/* Wait for all vacuum workers to finish */
-		WaitForParallelWorkersToFinish(lps->pcxt);
-
-		for (i = 0; i < lps->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
-	}
-
-	/*
-	 * Carry the shared balance value to heap scan and disable shared costing
-	 */
-	if (VacuumSharedCostBalance)
-	{
-		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
-		VacuumSharedCostBalance = NULL;
-		VacuumActiveNWorkers = NULL;
-	}
-}
-
-/*
- * Index vacuum/cleanup routine used by the leader process and parallel
- * vacuum worker processes to process the indexes in parallel.
- */
-static void
-parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-					  LVDeadTuples *dead_tuples, int nindexes,
-					  LVRelStats *vacrelstats)
-{
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	/* Loop until all indexes are vacuumed */
-	for (;;)
-	{
-		int			idx;
-		LVSharedIndStats *shared_indstats;
-
-		/* Get an index number to process */
-		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
-
-		/* Done for all indexes? */
-		if (idx >= nindexes)
-			break;
-
-		/* Get the index statistics of this index from DSM */
-		shared_indstats = get_indstats(lvshared, idx);
-
-		/*
-		 * Skip processing indexes that don't participate in parallel
-		 * operation
-		 */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[idx], lvshared))
-			continue;
-
-		/* Do vacuum or cleanup of the index */
-		vacuum_one_index(Irel[idx], &(vacrelstats->indstats[idx]), lvshared,
-						 shared_indstats, dead_tuples, vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup indexes that can be processed by only the leader process
- * because these indexes don't support parallel operation at that phase.
- */
-static void
-vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-					  LVParallelState *lps, int nindexes)
-{
-	int			i;
-
-	Assert(!IsParallelWorker());
-
-	/*
-	 * Increment the active worker count if we are able to launch any worker.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
-
-	for (i = 0; i < nindexes; i++)
-	{
-		LVSharedIndStats *shared_indstats;
-
-		shared_indstats = get_indstats(lps->lvshared, i);
-
-		/* Process the indexes skipped by parallel workers */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
-			vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared,
-							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
-	}
-
-	/*
-	 * We have completed the index vacuum so decrement the active worker
-	 * count.
-	 */
-	if (VacuumActiveNWorkers)
-		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
-}
-
-/*
- * Vacuum or cleanup index either by leader process or by one of the worker
- * process.  After processing the index this function copies the index
- * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
- * segment.
- */
-static void
-vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
-{
-	IndexBulkDeleteResult *bulkdelete_res = NULL;
-
-	if (shared_indstats)
-	{
-		/* Get the space for IndexBulkDeleteResult */
-		bulkdelete_res = &(shared_indstats->stats);
-
-		/*
-		 * Update the pointer to the corresponding bulk-deletion result if
-		 * someone has already updated it.
-		 */
-		if (shared_indstats->updated && *stats == NULL)
-			*stats = bulkdelete_res;
-	}
-
-	/* Do vacuum or cleanup of the index */
-	if (lvshared->for_cleanup)
-		lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-						   lvshared->estimated_count, vacrelstats);
-	else
-		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
-
-	/*
-	 * Copy the index bulk-deletion result returned from ambulkdelete and
-	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
-	 * allocate locally and it's possible that an index will be vacuumed by a
-	 * different vacuum process the next cycle.  Copying the result normally
-	 * happens only the first time an index is vacuumed.  For any additional
-	 * vacuum pass, we directly point to the result on the DSM segment and
-	 * pass it to vacuum index APIs so that workers can update it directly.
-	 *
-	 * Since all vacuum workers write the bulk-deletion result at different
-	 * slots we can write them without locking.
-	 */
-	if (shared_indstats && !shared_indstats->updated && *stats != NULL)
-	{
-		memcpy(bulkdelete_res, *stats, sizeof(IndexBulkDeleteResult));
-		shared_indstats->updated = true;
-
-		/*
-		 * Now that stats[idx] points to the DSM segment, we don't need the
-		 * locally allocated results.
-		 */
-		pfree(*stats);
-		*stats = bulkdelete_res;
-	}
-}
-
-/*
- *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
- *
- * Cleanup indexes.  We process the indexes serially unless we are doing
- * parallel vacuum.
- */
-static void
-lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-						 LVParallelState *lps, int nindexes)
-{
-	int			idx;
-
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
-
-	/* Report that we are now cleaning up indexes */
+	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
-								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
-	/*
-	 * If parallel vacuum is active we perform index cleanup with parallel
-	 * workers.
-	 */
-	if (ParallelVacuumIsActive(lps))
+	if (!vacrel->lps)
 	{
-		/* Tell parallel workers to do index cleanup */
-		lps->lvshared->for_cleanup = true;
-		lps->lvshared->first_time =
-			(vacrelstats->num_index_scans == 0);
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-		/*
-		 * Now we can provide a better estimate of total number of surviving
-		 * tuples (we assume indexes are more interested in that than in the
-		 * number of nominally live tuples).
-		 */
-		lps->lvshared->reltuples = vacrelstats->new_rel_tuples;
-		lps->lvshared->estimated_count =
-			(vacrelstats->tupcount_pages < vacrelstats->rel_pages);
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+			vacrel->indstats[idx] =
+				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
+									  vacrel);
+		}
 	}
 	else
 	{
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_cleanup_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							   vacrelstats->new_rel_tuples,
-							   vacrelstats->tupcount_pages < vacrelstats->rel_pages,
-							   vacrelstats);
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
+
+	/* Increase and report the number of index scans */
+	vacrel->num_index_scans++;
+	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
+								 vacrel->num_index_scans);
 }
 
 /*
- *	lazy_vacuum_index() -- vacuum one index relation.
+ *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index entries pointing to tuples listed in
  *		dead_tuples, and update running statistics.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					  double reltuples, LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2495,7 +1953,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2503,38 +1961,76 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_VACUUM_INDEX,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	*stats = index_bulk_delete(&ivinfo, *stats,
-							   lazy_tid_reaped, (void *) dead_tuples);
+	istat = index_bulk_delete(&ivinfo, istat, lazy_tid_reaped,
+							  (void *) vacrel->dead_tuples);
 
 	ereport(elevel,
 			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
+					vacrel->indname, vacrel->dead_tuples->num_tuples),
 			 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
 }
 
 /*
- *	lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
+ */
+static void
+lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	Assert(vacrel->nindexes > 0);
+
+	/* Report that we are now cleaning up indexes */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
+
+	if (!vacrel->lps)
+	{
+		double		reltuples = vacrel->new_rel_tuples;
+		bool		estimated_count =
+		vacrel->tupcount_pages < vacrel->rel_pages;
+
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
+
+			vacrel->indstats[idx] =
+				lazy_cleanup_one_index(indrel, istat, reltuples,
+									   estimated_count, vacrel);
+		}
+	}
+	else
+	{
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_cleanup_all_indexes(vacrel);
+	}
+}
+
+/*
+ *	lazy_cleanup_one_index() -- do post-vacuum cleanup for index relation.
  *
  *		reltuples is the number of heap tuples and estimated_count is true
  *		if reltuples is an estimated value.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_cleanup_index(Relation indrel,
-				   IndexBulkDeleteResult **stats,
-				   double reltuples, bool estimated_count, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					   double reltuples, bool estimated_count,
+					   LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2549,7 +2045,7 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.message_level = elevel;
 
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2557,35 +2053,261 @@ lazy_cleanup_index(Relation indrel,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	*stats = index_vacuum_cleanup(&ivinfo, *stats);
+	istat = index_vacuum_cleanup(&ivinfo, istat);
 
-	if (*stats)
+	if (istat)
 	{
 		ereport(elevel,
 				(errmsg("index \"%s\" now contains %.0f row versions in %u pages",
 						RelationGetRelationName(indrel),
-						(*stats)->num_index_tuples,
-						(*stats)->num_pages),
+						(istat)->num_index_tuples,
+						(istat)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
 						   "%u index pages were newly deleted.\n"
 						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
-						   (*stats)->tuples_removed,
-						   (*stats)->pages_newly_deleted,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (istat)->tuples_removed,
+						   (istat)->pages_newly_deleted,
+						   (istat)->pages_deleted, (istat)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
+}
+
+/*
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
+ *
+ * This routine marks dead tuples as unused and compacts out free space on
+ * their pages.  Pages not having dead tuples recorded from lazy_scan_heap are
+ * not visited at all.
+ *
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
+ */
+static void
+lazy_vacuum_heap_rel(LVRelState *vacrel)
+{
+	int			tupindex;
+	int			vacuumed_pages;
+	PGRUsage	ru0;
+	Buffer		vmbuffer = InvalidBuffer;
+	LVSavedErrInfo saved_err_info;
+
+	/* Report that we are now vacuuming the heap */
+	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+							 InvalidBlockNumber, InvalidOffsetNumber);
+
+	pg_rusage_init(&ru0);
+	vacuumed_pages = 0;
+
+	tupindex = 0;
+	while (tupindex < vacrel->dead_tuples->num_tuples)
+	{
+		BlockNumber tblk;
+		Buffer		buf;
+		Page		page;
+		Size		freespace;
+
+		vacuum_delay_point();
+
+		tblk = ItemPointerGetBlockNumber(&vacrel->dead_tuples->itemptrs[tupindex]);
+		vacrel->blkno = tblk;
+		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, tblk,
+								 RBM_NORMAL, vacrel->bstrategy);
+		if (!ConditionalLockBufferForCleanup(buf))
+		{
+			ReleaseBuffer(buf);
+			++tupindex;
+			continue;
+		}
+		tupindex = lazy_vacuum_heap_page(vacrel, tblk, buf, tupindex,
+										 &vmbuffer);
+
+		/* Now that we've compacted the page, record its available space */
+		page = BufferGetPage(buf);
+		freespace = PageGetHeapFreeSpace(page);
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->onerel, tblk, freespace);
+		vacuumed_pages++;
+	}
+
+	/* Clear the block number information */
+	vacrel->blkno = InvalidBlockNumber;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	ereport(elevel,
+			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
+					vacrel->relname, tupindex, vacuumed_pages),
+			 errdetail_internal("%s", pg_rusage_show(&ru0))));
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+}
+
+/*
+ *	lazy_vacuum_heap_page() -- free dead tuples on a page
+ *						  and repair its fragmentation.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
+ * this page.  We assume the rest follow sequentially.  The return value is
+ * the first tupindex after the tuples of this page.
+ */
+static int
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
+					  int tupindex, Buffer *vmbuffer)
+{
+	LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+	Page		page = BufferGetPage(buffer);
+	OffsetNumber unused[MaxOffsetNumber];
+	int			uncnt = 0;
+	TransactionId visibility_cutoff_xid;
+	bool		all_frozen;
+	LVSavedErrInfo saved_err_info;
+
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+
+	/* Update error traceback information */
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
+							 InvalidOffsetNumber);
+
+	START_CRIT_SECTION();
+
+	for (; tupindex < dead_tuples->num_tuples; tupindex++)
+	{
+		BlockNumber tblk;
+		OffsetNumber toff;
+		ItemId		itemid;
+
+		tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]);
+		if (tblk != blkno)
+			break;				/* past end of tuples for this block */
+		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
+		itemid = PageGetItemId(page, toff);
+		ItemIdSetUnused(itemid);
+		unused[uncnt++] = toff;
+	}
+
+	PageRepairFragmentation(page);
+
+	/*
+	 * Mark buffer dirty before we write WAL.
+	 */
+	MarkBufferDirty(buffer);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(vacrel->onerel))
+	{
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_clean(vacrel->onerel, buffer,
+								NULL, 0, NULL, 0,
+								unused, uncnt,
+								vacrel->latestRemovedXid);
+		PageSetLSN(page, recptr);
+	}
+
+	/*
+	 * End critical section, so we safely can do visibility tests (which
+	 * possibly need to perform IO and allocate memory!). If we crash now the
+	 * page (including the corresponding vm bit) might not be marked all
+	 * visible, but that's fine. A later vacuum will fix that.
+	 */
+	END_CRIT_SECTION();
+
+	/*
+	 * Now that we have removed the dead tuples from the page, once again
+	 * check if the page has become all-visible.  The page is already marked
+	 * dirty, exclusively locked, and, if needed, a full page image has been
+	 * emitted in the log_heap_clean() above.
+	 */
+	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
+								 &all_frozen))
+		PageSetAllVisible(page);
+
+	/*
+	 * All the changes to the heap page have been done. If the all-visible
+	 * flag is now set, also set the VM all-visible bit (and, if possible, the
+	 * all-frozen bit) unless this has already been done previously.
+	 */
+	if (PageIsAllVisible(page))
+	{
+		uint8		vm_status = visibilitymap_get_status(vacrel->onerel, blkno, vmbuffer);
+		uint8		flags = 0;
+
+		/* Set the VM all-frozen bit to flag, if needed */
+		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			flags |= VISIBILITYMAP_ALL_VISIBLE;
+		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		Assert(BufferIsValid(*vmbuffer));
+		if (flags != 0)
+			visibilitymap_set(vacrel->onerel, blkno, buffer, InvalidXLogRecPtr,
+							  *vmbuffer, visibility_cutoff_xid, flags);
+	}
+
+	/* Revert to the previous phase information for error traceback */
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	return tupindex;
+}
+
+/*
+ * Update index statistics in pg_class if the statistics are accurate.
+ */
+static void
+update_index_statistics(LVRelState *vacrel)
+{
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+
+	Assert(!IsInParallelMode());
+
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		Relation	indrel = indrels[idx];
+		IndexBulkDeleteResult *istat = indstats[idx];
+
+		if (istat == NULL || istat->estimated_count)
+			continue;
+
+		/* Update index statistics */
+		vac_update_relstats(indrel,
+							istat->num_pages,
+							istat->num_index_tuples,
+							0,
+							false,
+							InvalidTransactionId,
+							InvalidMultiXactId,
+							false);
+	}
 }
 
 /*
@@ -2608,17 +2330,17 @@ lazy_cleanup_index(Relation indrel,
  * careful to depend only on fields that lazy_scan_heap updates on-the-fly.
  */
 static bool
-should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
+should_attempt_truncation(LVRelState *vacrel, VacuumParams *params)
 {
 	BlockNumber possibly_freeable;
 
 	if (params->truncate == VACOPT_TERNARY_DISABLED)
 		return false;
 
-	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
+		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION) &&
 		old_snapshot_threshold < 0)
 		return true;
 	else
@@ -2629,9 +2351,10 @@ should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
  * lazy_truncate_heap - try to truncate off any empty pages at the end
  */
 static void
-lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
+lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber old_rel_pages = vacrelstats->rel_pages;
+	Relation	onerel = vacrel->onerel;
+	BlockNumber old_rel_pages = vacrel->rel_pages;
 	BlockNumber new_rel_pages;
 	int			lock_retry;
 
@@ -2655,7 +2378,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * (which is quite possible considering we already hold a lower-grade
 		 * lock).
 		 */
-		vacrelstats->lock_waiter_detected = false;
+		vacrel->lock_waiter_detected = false;
 		lock_retry = 0;
 		while (true)
 		{
@@ -2675,10 +2398,10 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 				 * We failed to establish the lock in the specified number of
 				 * retries. This means we give up truncating.
 				 */
-				vacrelstats->lock_waiter_detected = true;
+				vacrel->lock_waiter_detected = true;
 				ereport(elevel,
 						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
-								vacrelstats->relname)));
+								vacrel->relname)));
 				return;
 			}
 
@@ -2694,11 +2417,11 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		if (new_rel_pages != old_rel_pages)
 		{
 			/*
-			 * Note: we intentionally don't update vacrelstats->rel_pages with
-			 * the new rel size here.  If we did, it would amount to assuming
-			 * that the new pages are empty, which is unlikely. Leaving the
-			 * numbers alone amounts to assuming that the new pages have the
-			 * same tuple density as existing ones, which is less unlikely.
+			 * Note: we intentionally don't update vacrel->rel_pages with the
+			 * new rel size here.  If we did, it would amount to assuming that
+			 * the new pages are empty, which is unlikely. Leaving the numbers
+			 * alone amounts to assuming that the new pages have the same
+			 * tuple density as existing ones, which is less unlikely.
 			 */
 			UnlockRelation(onerel, AccessExclusiveLock);
 			return;
@@ -2710,8 +2433,8 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
-		vacrelstats->blkno = new_rel_pages;
+		new_rel_pages = count_nondeletable_pages(vacrel);
+		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
 		{
@@ -2739,18 +2462,18 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
-		vacrelstats->rel_pages = new_rel_pages;
+		vacrel->pages_removed += old_rel_pages - new_rel_pages;
+		vacrel->rel_pages = new_rel_pages;
 
 		ereport(elevel,
 				(errmsg("\"%s\": truncated %u to %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						old_rel_pages, new_rel_pages),
 				 errdetail_internal("%s",
 									pg_rusage_show(&ru0))));
 		old_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrelstats->nonempty_pages &&
-			 vacrelstats->lock_waiter_detected);
+	} while (new_rel_pages > vacrel->nonempty_pages &&
+			 vacrel->lock_waiter_detected);
 }
 
 /*
@@ -2759,8 +2482,9 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
+count_nondeletable_pages(LVRelState *vacrel)
 {
+	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
 	instr_time	starttime;
@@ -2774,11 +2498,11 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrelstats->rel_pages;
+	blkno = vacrel->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrelstats->nonempty_pages)
+	while (blkno > vacrel->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2809,9 +2533,9 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 				{
 					ereport(elevel,
 							(errmsg("\"%s\": suspending truncate due to conflicting lock request",
-									vacrelstats->relname)));
+									vacrel->relname)));
 
-					vacrelstats->lock_waiter_detected = true;
+					vacrel->lock_waiter_detected = true;
 					return blkno;
 				}
 				starttime = currenttime;
@@ -2842,8 +2566,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			prefetchedUntil = prefetchStart;
 		}
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 vacrel->bstrategy);
 
 		/* In this phase we only need shared access to the buffer */
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
@@ -2891,7 +2615,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrelstats->nonempty_pages;
+	return vacrel->nonempty_pages;
 }
 
 /*
@@ -2930,18 +2654,62 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelState *vacrel, int nworkers, BlockNumber nblocks)
 {
-	LVDeadTuples *dead_tuples = NULL;
+	LVDeadTuples *dead_tuples;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	/*
+	 * Initialize state for a parallel vacuum.  As of now, only one worker can
+	 * be used for an index, so we invoke parallelism only if there are at
+	 * least two indexes on a table.
+	 */
+	if (nworkers >= 0 && vacrel->nindexes > 1)
+	{
+		/*
+		 * Since parallel workers cannot access data in temporary tables, we
+		 * can't perform parallel vacuum on them.
+		 */
+		if (RelationUsesLocalBuffers(vacrel->onerel))
+		{
+			/*
+			 * Give warning only if the user explicitly tries to perform a
+			 * parallel vacuum on the temporary table.
+			 */
+			if (nworkers > 0)
+				ereport(WARNING,
+						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
+								vacrel->relname)));
+		}
+		else
+			vacrel->lps = begin_parallel_vacuum(vacrel, nblocks, nworkers);
+
+		/* If parallel mode started, we're done */
+		if (vacrel->lps != NULL)
+			return;
+	}
+
+	maxtuples = compute_max_dead_tuples(nblocks, vacrel->nindexes > 0);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
+}
+
+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+	if (!vacrel->lps)
+		return;
+
+	/*
+	 * End parallel mode before updating index statistics as we cannot write
+	 * during parallel mode.
+	 */
+	end_parallel_vacuum(vacrel);
 }
 
 /*
@@ -3039,8 +2807,7 @@ vac_cmp_itemptr(const void *left, const void *right)
  * on this page is frozen.
  */
 static bool
-heap_page_is_all_visible(Relation rel, Buffer buf,
-						 LVRelStats *vacrelstats,
+heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 						 TransactionId *visibility_cutoff_xid,
 						 bool *all_frozen)
 {
@@ -3069,7 +2836,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* Unused or redirect line pointers are of no interest */
@@ -3093,9 +2860,9 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 		tuple.t_len = ItemIdGetLength(itemid);
-		tuple.t_tableOid = RelationGetRelid(rel);
+		tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3114,7 +2881,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
@@ -3148,7 +2915,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return all_visible;
 }
@@ -3167,14 +2934,13 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * vacuum.
  */
 static int
-compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+compute_parallel_vacuum_workers(LVRelState *vacrel, int nrequested,
 								bool *can_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
 	int			parallel_workers;
-	int			i;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -3186,15 +2952,16 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = vacrel->indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		if (vacoptions == VACUUM_OPTION_NO_PARALLEL ||
-			RelationGetNumberOfBlocks(Irel[i]) < min_parallel_index_scan_size)
+			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		can_parallel_vacuum[i] = true;
+		can_parallel_vacuum[idx] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -3223,70 +2990,19 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	return parallel_workers;
 }
 
-/*
- * Initialize variables for shared index statistics, set NULL bitmap and the
- * size of stats for each index.
- */
-static void
-prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-						 int nindexes)
-{
-	int			i;
-
-	/* Currently, we don't support parallel vacuum for autovacuum */
-	Assert(!IsAutoVacuumWorkerProcess());
-
-	/* Set NULL for all indexes */
-	memset(lvshared->bitmap, 0x00, BITMAPLEN(nindexes));
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (!can_parallel_vacuum[i])
-			continue;
-
-		/* Set NOT NULL as this index does support parallelism */
-		lvshared->bitmap[i >> 3] |= 1 << (i & 0x07);
-	}
-}
-
-/*
- * Update index statistics in pg_class if the statistics are accurate.
- */
-static void
-update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-						int nindexes)
-{
-	int			i;
-
-	Assert(!IsInParallelMode());
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (stats[i] == NULL || stats[i]->estimated_count)
-			continue;
-
-		/* Update index statistics */
-		vac_update_relstats(Irel[i],
-							stats[i]->num_pages,
-							stats[i]->num_index_tuples,
-							0,
-							false,
-							InvalidTransactionId,
-							InvalidMultiXactId,
-							false);
-	}
-}
-
 /*
  * This function prepares and returns parallel vacuum state if we can launch
  * even one worker.  This function is responsible for entering parallel mode,
  * create a parallel context, and then initialize the DSM segment.
  */
 static LVParallelState *
-begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
-					  BlockNumber nblocks, int nindexes, int nrequested)
+begin_parallel_vacuum(LVRelState *vacrel, BlockNumber nblocks,
+					  int nrequested)
 {
 	LVParallelState *lps = NULL;
+	Relation	onerel = vacrel->onerel;
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
 	ParallelContext *pcxt;
 	LVShared   *shared;
 	LVDeadTuples *dead_tuples;
@@ -3299,7 +3015,6 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
-	int			i;
 
 	/*
 	 * A parallel vacuum must be requested and there must be indexes on the
@@ -3312,7 +3027,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	can_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = compute_parallel_vacuum_workers(Irel, nindexes,
+	parallel_workers = compute_parallel_vacuum_workers(vacrel,
 													   nrequested,
 													   can_parallel_vacuum);
 
@@ -3333,9 +3048,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 
 	/* Estimate size for shared information -- PARALLEL_VACUUM_KEY_SHARED */
 	est_shared = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		/*
 		 * Cleanup option should be either disabled, always performing in
@@ -3346,10 +3062,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
 		/* Skip indexes that don't participate in parallel vacuum */
-		if (!can_parallel_vacuum[i])
+		if (!can_parallel_vacuum[idx])
 			continue;
 
-		if (Irel[i]->rd_indam->amusemaintenanceworkmem)
+		if (indrel->rd_indam->amusemaintenanceworkmem)
 			nindexes_mwm++;
 
 		est_shared = add_size(est_shared, sizeof(LVSharedIndStats));
@@ -3404,7 +3120,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	/* Prepare shared information */
 	shared = (LVShared *) shm_toc_allocate(pcxt->toc, est_shared);
 	MemSet(shared, 0, est_shared);
-	shared->relid = relid;
+	shared->onereloid = RelationGetRelid(onerel);
 	shared->elevel = elevel;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
@@ -3415,7 +3131,20 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
 	pg_atomic_init_u32(&(shared->idx), 0);
 	shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
+
+	/*
+	 * Initialize variables for shared index statistics, set NULL bitmap and
+	 * the size of stats for each index.
+	 */
+	memset(shared->bitmap, 0x00, BITMAPLEN(nindexes));
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		if (!can_parallel_vacuum[idx])
+			continue;
+
+		/* Set NOT NULL as this index does support parallelism */
+		shared->bitmap[idx >> 3] |= 1 << (idx & 0x07);
+	}
 
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	lps->lvshared = shared;
@@ -3426,7 +3155,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	dead_tuples->num_tuples = 0;
 	MemSet(dead_tuples->itemptrs, 0, sizeof(ItemPointerData) * maxtuples);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_TUPLES, dead_tuples);
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
 
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
@@ -3467,32 +3196,35 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
  * context, but that won't be safe (see ExitParallelMode).
  */
 static void
-end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
-					int nindexes)
+end_parallel_vacuum(LVRelState *vacrel)
 {
-	int			i;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+	LVParallelState *lps = vacrel->lps;
+	int			nindexes = vacrel->nindexes;
 
 	Assert(!IsParallelWorker());
 
 	/* Copy the updated statistics */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		LVSharedIndStats *indstats = get_indstats(lps->lvshared, i);
+		LVSharedIndStats *shared_istat;
+
+		shared_istat = parallel_stats_for_idx(lps->lvshared, idx);
 
 		/*
 		 * Skip unused slot.  The statistics of this index are already stored
 		 * in local memory.
 		 */
-		if (indstats == NULL)
+		if (shared_istat == NULL)
 			continue;
 
-		if (indstats->updated)
+		if (shared_istat->updated)
 		{
-			stats[i] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-			memcpy(stats[i], &(indstats->stats), sizeof(IndexBulkDeleteResult));
+			indstats[idx] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+			memcpy(indstats[idx], &(shared_istat->istat), sizeof(IndexBulkDeleteResult));
 		}
 		else
-			stats[i] = NULL;
+			indstats[idx] = NULL;
 	}
 
 	DestroyParallelContext(lps->pcxt);
@@ -3500,23 +3232,364 @@ end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
 
 	/* Deactivate parallel vacuum */
 	pfree(lps);
-	lps = NULL;
+	vacrel->lps = NULL;
 }
 
-/* Return the Nth index statistics or NULL */
-static LVSharedIndStats *
-get_indstats(LVShared *lvshared, int n)
+static void
+do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel)
+{
+	/* Tell parallel workers to do index vacuuming */
+	vacrel->lps->lvshared->for_cleanup = false;
+	vacrel->lps->lvshared->first_time = false;
+
+	/*
+	 * We can only provide an approximate value of num_heap_tuples in vacuum
+	 * cases.
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->old_live_tuples;
+	vacrel->lps->lvshared->estimated_count = true;
+
+	do_parallel_vacuum_or_cleanup(vacrel,
+								  vacrel->lps->nindexes_parallel_bulkdel);
+}
+
+static void
+do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	/*
+	 * If parallel vacuum is active we perform index cleanup with parallel
+	 * workers.
+	 *
+	 * Tell parallel workers to do index cleanup.
+	 */
+	vacrel->lps->lvshared->for_cleanup = true;
+	vacrel->lps->lvshared->first_time = (vacrel->num_index_scans == 0);
+
+	/*
+	 * Now we can provide a better estimate of total number of surviving
+	 * tuples (we assume indexes are more interested in that than in the
+	 * number of nominally live tuples).
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
+	vacrel->lps->lvshared->estimated_count =
+		(vacrel->tupcount_pages < vacrel->rel_pages);
+
+	/* Determine the number of parallel workers to launch */
+	if (vacrel->lps->lvshared->first_time)
+		nworkers = vacrel->lps->nindexes_parallel_cleanup +
+			vacrel->lps->nindexes_parallel_condcleanup;
+	else
+		nworkers = vacrel->lps->nindexes_parallel_cleanup;
+
+	do_parallel_vacuum_or_cleanup(vacrel, nworkers);
+}
+
+/*
+ * Perform index vacuum or index cleanup with parallel workers.  This function
+ * must be used by the parallel vacuum leader process.  The caller must set
+ * lps->lvshared->for_cleanup to indicate whether to perform vacuum or
+ * cleanup.
+ */
+static void
+do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers)
+{
+	LVParallelState *lps = vacrel->lps;
+
+	Assert(!IsParallelWorker());
+	Assert(vacrel->nindexes > 0);
+
+	/* The leader process will participate */
+	nworkers--;
+
+	/*
+	 * It is possible that parallel context is initialized with fewer workers
+	 * than the number of indexes that need a separate worker in the current
+	 * phase, so we need to consider it.  See compute_parallel_vacuum_workers.
+	 */
+	nworkers = Min(nworkers, lps->pcxt->nworkers);
+
+	/* Setup the shared cost-based vacuum delay and launch workers */
+	if (nworkers > 0)
+	{
+		if (vacrel->num_index_scans > 0)
+		{
+			/* Reset the parallel index processing counter */
+			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
+
+			/* Reinitialize the parallel context to relaunch parallel workers */
+			ReinitializeParallelDSM(lps->pcxt);
+		}
+
+		/*
+		 * Set up shared cost balance and the number of active workers for
+		 * vacuum delay.  We need to do this before launching workers as
+		 * otherwise, they might not see the updated values for these
+		 * parameters.
+		 */
+		pg_atomic_write_u32(&(lps->lvshared->cost_balance), VacuumCostBalance);
+		pg_atomic_write_u32(&(lps->lvshared->active_nworkers), 0);
+
+		/*
+		 * The number of workers can vary between bulkdelete and cleanup
+		 * phase.
+		 */
+		ReinitializeParallelWorkers(lps->pcxt, nworkers);
+
+		LaunchParallelWorkers(lps->pcxt);
+
+		if (lps->pcxt->nworkers_launched > 0)
+		{
+			/*
+			 * Reset the local cost values for leader backend as we have
+			 * already accumulated the remaining balance of heap.
+			 */
+			VacuumCostBalance = 0;
+			VacuumCostBalanceLocal = 0;
+
+			/* Enable shared cost balance for leader backend */
+			VacuumSharedCostBalance = &(lps->lvshared->cost_balance);
+			VacuumActiveNWorkers = &(lps->lvshared->active_nworkers);
+		}
+
+		if (lps->lvshared->for_cleanup)
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index cleanup (planned: %d)",
+									 "launched %d parallel vacuum workers for index cleanup (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+		else
+			ereport(elevel,
+					(errmsg(ngettext("launched %d parallel vacuum worker for index vacuuming (planned: %d)",
+									 "launched %d parallel vacuum workers for index vacuuming (planned: %d)",
+									 lps->pcxt->nworkers_launched),
+							lps->pcxt->nworkers_launched, nworkers)));
+	}
+
+	/* Process the indexes that can be processed by only leader process */
+	do_serial_processing_for_unsafe_indexes(vacrel, lps->lvshared);
+
+	/*
+	 * Join as a parallel worker.  The leader process alone processes all the
+	 * indexes in the case where no workers are launched.
+	 */
+	do_parallel_processing(vacrel, lps->lvshared);
+
+	/*
+	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
+	 * to finish, or we might get incomplete data.)
+	 */
+	if (nworkers > 0)
+	{
+		/* Wait for all vacuum workers to finish */
+		WaitForParallelWorkersToFinish(lps->pcxt);
+
+		for (int i = 0; i < lps->pcxt->nworkers_launched; i++)
+			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
+	}
+
+	/*
+	 * Carry the shared balance value to heap scan and disable shared costing
+	 */
+	if (VacuumSharedCostBalance)
+	{
+		VacuumCostBalance = pg_atomic_read_u32(VacuumSharedCostBalance);
+		VacuumSharedCostBalance = NULL;
+		VacuumActiveNWorkers = NULL;
+	}
+}
+
+/*
+ * Index vacuum/cleanup routine used by the leader process and parallel
+ * vacuum worker processes to process the indexes in parallel.
+ */
+static void
+do_parallel_processing(LVRelState *vacrel, LVShared *lvshared)
+{
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	/* Loop until all indexes are vacuumed */
+	for (;;)
+	{
+		int			idx;
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		/* Get an index number to process */
+		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
+
+		/* Done for all indexes? */
+		if (idx >= vacrel->nindexes)
+			break;
+
+		/* Get the index statistics of this index from DSM */
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip indexes not participating in parallelism */
+		if (shared_istat == NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * Skip processing indexes that are unsafe for workers (these are
+		 * processed in do_serial_processing_for_unsafe_indexes() by leader)
+		 */
+		if (!parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup indexes that can be processed by only the leader process
+ * because these indexes don't support parallel operation at that phase.
+ */
+static void
+do_serial_processing_for_unsafe_indexes(LVRelState *vacrel, LVShared *lvshared)
+{
+	Assert(!IsParallelWorker());
+
+	/*
+	 * Increment the active worker count if we are able to launch any worker.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
+	{
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
+
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip already-complete indexes */
+		if (shared_istat != NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * We're only here for the unsafe indexes
+		 */
+		if (parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
+	}
+
+	/*
+	 * We have completed the index vacuum so decrement the active worker
+	 * count.
+	 */
+	if (VacuumActiveNWorkers)
+		pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
+}
+
+/*
+ * Vacuum or cleanup index either by leader process or by one of the worker
+ * process.  After processing the index this function copies the index
+ * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
+ * segment.
+ */
+static IndexBulkDeleteResult *
+parallel_process_one_index(Relation indrel,
+						   IndexBulkDeleteResult *istat,
+						   LVShared *lvshared,
+						   LVSharedIndStats *shared_istat,
+						   LVRelState *vacrel)
+{
+	IndexBulkDeleteResult *bulkdelete_res = NULL;
+
+	if (shared_istat)
+	{
+		/* Get the space for IndexBulkDeleteResult */
+		bulkdelete_res = &(shared_istat->istat);
+
+		/*
+		 * Update the pointer to the corresponding bulk-deletion result if
+		 * someone has already updated it.
+		 */
+		if (shared_istat->updated && istat == NULL)
+			istat = bulkdelete_res;
+	}
+
+	/* Do vacuum or cleanup of the index */
+	if (lvshared->for_cleanup)
+		istat = lazy_cleanup_one_index(indrel, istat, lvshared->reltuples,
+									   lvshared->estimated_count, vacrel);
+	else
+		istat = lazy_vacuum_one_index(indrel, istat, lvshared->reltuples,
+									  vacrel);
+
+	/*
+	 * Copy the index bulk-deletion result returned from ambulkdelete and
+	 * amvacuumcleanup to the DSM segment if it's the first cycle because they
+	 * allocate locally and it's possible that an index will be vacuumed by a
+	 * different vacuum process the next cycle.  Copying the result normally
+	 * happens only the first time an index is vacuumed.  For any additional
+	 * vacuum pass, we directly point to the result on the DSM segment and
+	 * pass it to vacuum index APIs so that workers can update it directly.
+	 *
+	 * Since all vacuum workers write the bulk-deletion result at different
+	 * slots we can write them without locking.
+	 */
+	if (shared_istat && !shared_istat->updated && istat != NULL)
+	{
+		memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+		shared_istat->updated = true;
+
+		/*
+		 * Now that top-level indstats[idx] points to the DSM segment, we
+		 * don't need the locally allocated results.
+		 */
+		pfree(istat);
+		istat = bulkdelete_res;
+	}
+
+	return istat;
+}
+
+/*
+ * Return shared memory statistics for index at offset 'getidx', if any
+ */
+static LVSharedIndStats *
+parallel_stats_for_idx(LVShared *lvshared, int getidx)
 {
-	int			i;
 	char	   *p;
 
-	if (IndStatsIsNull(lvshared, n))
+	if (IndStatsIsNull(lvshared, getidx))
 		return NULL;
 
 	p = (char *) GetSharedIndStats(lvshared);
-	for (i = 0; i < n; i++)
+	for (int idx = 0; idx < getidx; idx++)
 	{
-		if (IndStatsIsNull(lvshared, i))
+		if (IndStatsIsNull(lvshared, idx))
 			continue;
 
 		p += sizeof(LVSharedIndStats);
@@ -3526,11 +3599,11 @@ get_indstats(LVShared *lvshared, int n)
 }
 
 /*
- * Returns true, if the given index can't participate in parallel index vacuum
- * or parallel index cleanup, false, otherwise.
+ * Returns false, if the given index can't participate in parallel index
+ * vacuum or parallel index cleanup
  */
 static bool
-skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
+parallel_processing_is_safe(Relation indrel, LVShared *lvshared)
 {
 	uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
@@ -3552,15 +3625,15 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 		 */
 		if (!lvshared->first_time &&
 			((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
-			return true;
+			return false;
 	}
 	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0)
 	{
 		/* Skip if the index does not support parallel bulk deletion */
-		return true;
+		return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3580,7 +3653,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	LVRelStats	vacrelstats;
+	LVRelState	vacrel;
 	ErrorContextCallback errcallback;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3602,7 +3675,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * okay because the lock mode does not conflict among the parallel
 	 * workers.
 	 */
-	onerel = table_open(lvshared->relid, ShareUpdateExclusiveLock);
+	onerel = table_open(lvshared->onereloid, ShareUpdateExclusiveLock);
 
 	/*
 	 * Open all indexes. indrels are sorted in order by OID, which should be
@@ -3626,24 +3699,27 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
 	VacuumActiveNWorkers = &(lvshared->active_nworkers);
 
-	vacrelstats.indstats = (IndexBulkDeleteResult **)
+	vacrel.onerel = onerel;
+	vacrel.indrels = indrels;
+	vacrel.nindexes = nindexes;
+	vacrel.indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	if (lvshared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = lvshared->maintenance_work_mem_worker;
 
 	/*
-	 * Initialize vacrelstats for use as error callback arg by parallel
-	 * worker.
+	 * Initialize vacrel for use as error callback arg by parallel worker.
 	 */
-	vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats.relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats.indname = NULL;
-	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
+	vacrel.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
+	vacrel.relname = pstrdup(RelationGetRelationName(onerel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_UNKNOWN;	/* Not yet processing */
+	vacrel.dead_tuples = dead_tuples;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = &vacrelstats;
+	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
@@ -3651,8 +3727,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_index(indrels, lvshared, dead_tuples, nindexes,
-						  &vacrelstats);
+	do_parallel_processing(&vacrel, lvshared);
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -3665,7 +3740,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
 	table_close(onerel, ShareUpdateExclusiveLock);
-	pfree(vacrelstats.indstats);
+	pfree(vacrel.indstats);
 }
 
 /*
@@ -3674,7 +3749,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 static void
 vacuum_error_callback(void *arg)
 {
-	LVRelStats *errinfo = arg;
+	LVRelState *errinfo = arg;
 
 	switch (errinfo->phase)
 	{
@@ -3736,28 +3811,29 @@ vacuum_error_callback(void *arg)
  * the current information which can be later restored via restore_vacuum_error_info.
  */
 static void
-update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info, int phase,
-						 BlockNumber blkno, OffsetNumber offnum)
+update_vacuum_error_info(LVRelState *vacrel, LVSavedErrInfo *saved_vacrel,
+						 int phase, BlockNumber blkno, OffsetNumber offnum)
 {
-	if (saved_err_info)
+	if (saved_vacrel)
 	{
-		saved_err_info->offnum = errinfo->offnum;
-		saved_err_info->blkno = errinfo->blkno;
-		saved_err_info->phase = errinfo->phase;
+		saved_vacrel->offnum = vacrel->offnum;
+		saved_vacrel->blkno = vacrel->blkno;
+		saved_vacrel->phase = vacrel->phase;
 	}
 
-	errinfo->blkno = blkno;
-	errinfo->offnum = offnum;
-	errinfo->phase = phase;
+	vacrel->blkno = blkno;
+	vacrel->offnum = offnum;
+	vacrel->phase = phase;
 }
 
 /*
  * Restores the vacuum information saved via a prior call to update_vacuum_error_info.
  */
 static void
-restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info)
+restore_vacuum_error_info(LVRelState *vacrel,
+						  const LVSavedErrInfo *saved_vacrel)
 {
-	errinfo->blkno = saved_err_info->blkno;
-	errinfo->offnum = saved_err_info->offnum;
-	errinfo->phase = saved_err_info->phase;
+	vacrel->blkno = saved_vacrel->blkno;
+	vacrel->offnum = saved_vacrel->offnum;
+	vacrel->phase = saved_vacrel->phase;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..9b5afa12ad 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -689,7 +689,7 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
  */
 IndexBulkDeleteResult *
 index_bulk_delete(IndexVacuumInfo *info,
-				  IndexBulkDeleteResult *stats,
+				  IndexBulkDeleteResult *istat,
 				  IndexBulkDeleteCallback callback,
 				  void *callback_state)
 {
@@ -698,7 +698,7 @@ index_bulk_delete(IndexVacuumInfo *info,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambulkdelete);
 
-	return indexRelation->rd_indam->ambulkdelete(info, stats,
+	return indexRelation->rd_indam->ambulkdelete(info, istat,
 												 callback, callback_state);
 }
 
@@ -710,14 +710,14 @@ index_bulk_delete(IndexVacuumInfo *info,
  */
 IndexBulkDeleteResult *
 index_vacuum_cleanup(IndexVacuumInfo *info,
-					 IndexBulkDeleteResult *stats)
+					 IndexBulkDeleteResult *istat)
 {
 	Relation	indexRelation = info->index;
 
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(amvacuumcleanup);
 
-	return indexRelation->rd_indam->amvacuumcleanup(info, stats);
+	return indexRelation->rd_indam->amvacuumcleanup(info, istat);
 }
 
 /* ----------------
-- 
2.27.0

v9-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v9-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 99fa83ab0cfecd77b8eea924499ea37bcf95e2ad Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v9 3/4] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++--
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 +++++++++-----------
 src/backend/access/heap/pruneheap.c      |  60 +++---
 src/backend/access/heap/vacuumlazy.c     | 230 ++++++++++-------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 ++--
 src/backend/replication/logical/decode.c |   4 +-
 src/tools/pgindent/typedefs.list         |   4 +-
 11 files changed, 282 insertions(+), 320 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d803f27787..49d3193231 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -186,7 +186,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..d5df7c20df 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Used by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused ;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 595310ba1b..9cbc161d7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7538,7 +7538,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7957,88 +7957,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8510,34 +8428,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8548,12 +8447,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8606,7 +8501,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8618,6 +8513,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp) && !ItemIdHasStorage(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9722,15 +9691,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a36f0afd1e..d4123048b6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -305,7 +305,6 @@ typedef struct LVRelState
 	/* onerel's initial relfrozenxid and relminmxid */
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
-	TransactionId latestRemovedXid;
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
@@ -402,8 +401,7 @@ static void lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf,
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							GlobalVisState *vistest,
 							LVPagePruneState *pageprunestate,
-							LVPageVisMapState *pagevmstate,
-							VacOptTernaryValue index_cleanup);
+							LVPageVisMapState *pagevmstate);
 static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -565,7 +563,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrel->old_live_tuples = onerel->rd_rel->reltuples;
 	vacrel->relfrozenxid = onerel->rd_rel->relfrozenxid;
 	vacrel->relminmxid = onerel->rd_rel->relminmxid;
-	vacrel->latestRemovedXid = InvalidTransactionId;
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
@@ -807,40 +804,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(LVRelState *vacrel)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(vacrel->onerel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrel->latestRemovedXid))
-		(void) log_heap_cleanup_info(vacrel->onerel->rd_node,
-									 vacrel->latestRemovedXid);
-}
-
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -1276,8 +1239,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * Also handles tuple freezing -- considers freezing XIDs from all
 		 * tuple headers left behind following pruning.
 		 */
-		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate,
-						params->index_cleanup);
+		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate);
 
 		/*
 		 * Step 7 for block: Set up details for saving free space in FSM at
@@ -1717,12 +1679,32 @@ lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where heap_page_prune()
+ * was allowed to disagree with our HeapTupleSatisfiesVacuum() call about
+ * whether or not a tuple should be considered DEAD.  This happened when an
+ * inserting transaction concurrently aborted (after our heap_page_prune()
+ * call, before our HeapTupleSatisfiesVacuum() call).  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs are with respect to the OldestXid cutoff.
+ *
+ * This created rare, hard to test cases -- exceptions to the general rule
+ * that TIDs that we enter into the dead_tuples array are in fact just LP_DEAD
+ * items without storage.  We had rather a lot of complexity to account for
+ * tuples that were dead, but still had storage, and so still had a tuple
+ * header with XIDs that were not quite unambiguously after the FreezeLimit
+ * limit.
+ *
+ * The approach we take here now is a little crude, but it's also simple and
+ * robust: we restart pruning when the race condition is detected.  This
+ * guarantees that any items that make it into the dead_tuples array are
+ * simple LP_DEAD line pointers, and that every item with tuple storage is
+ * considered as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 				LVPagePruneState *pageprunestate,
-				LVPageVisMapState *pagevmstate,
-				VacOptTernaryValue index_cleanup)
+				LVPageVisMapState *pagevmstate)
 {
 	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
@@ -1731,6 +1713,7 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 				maxoff;
 	ItemId		itemid;
 	HeapTupleData tuple;
+	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
 				new_dead_tuples,
@@ -1745,6 +1728,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	page = BufferGetPage(buf);
 	maxoff = PageGetMaxOffsetNumber(page);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
@@ -1765,12 +1750,14 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	 */
 	tuples_deleted = heap_page_prune(onerel, buf, vistest,
 									 InvalidTransactionId, 0, false,
-									 &vacrel->latestRemovedXid,
 									 &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect vacuumable items and check for tuples
 	 * requiring freezing.
+	 *
+	 * Note: If we retry having set pagevmstate.visibility_cutoff_xid it
+	 * doesn't matter -- the newest XMIN on page can't be missed this way.
 	 */
 	pageprunestate->hastup = false;
 	pageprunestate->has_lpdead_items = false;
@@ -1781,8 +1768,6 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
-		bool		tupgone = false;
-
 		/*
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
@@ -1821,6 +1806,17 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(onerel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+		if (unlikely(res == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1831,42 +1827,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (res)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					new_dead_tuples++;
-				else
-					tupgone = true; /* we can delete the tuple */
-				pageprunestate->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1914,7 +1876,8 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 
 				/*
 				 * If tuple is recently deleted then we must not remove it
-				 * from relation.
+				 * from relation.  (We only remove items that are LP_DEAD from
+				 * pruning.)
 				 */
 				new_dead_tuples++;
 				pageprunestate->all_visible = false;
@@ -1946,27 +1909,13 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 				break;
 		}
 
-		if (tupgone)
-		{
-			/* Pretend that this is an LP_DEAD item  */
-			deadoffsets[lpdead_items++] = offnum;
-			pageprunestate->all_visible = false;
-			pageprunestate->has_lpdead_items = true;
-
-			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrel->latestRemovedXid);
-		}
-		else
-		{
-			/*
-			 * Each non-removable tuple must be checked to see if it needs
-			 * freezing
-			 */
-			tupoffsets[num_tuples++] = offnum;
-			pageprunestate->hastup = true;
-			/* Consider pageprunestate->all_frozen below, during freezing */
-		}
+		/*
+		 * Each non-removable tuple must be checked to see if it needs
+		 * freezing
+		 */
+		tupoffsets[num_tuples++] = offnum;
+		pageprunestate->hastup = true;
+		/* Consider pageprunestate->all_frozen below, during freezing */
 	}
 
 	/*
@@ -1977,9 +1926,6 @@ lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
 	 *
 	 * Add page level counters to caller's counts, and then actually process
 	 * LP_DEAD and LP_NORMAL items.
-	 *
-	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
-	 * to pretend that DEAD items are LP_DEAD items.
 	 */
 	Assert(lpdead_items + num_tuples + nunused + nredirect == maxoff);
 	vacrel->offnum = InvalidOffsetNumber;
@@ -2178,9 +2124,6 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(vacrel);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2203,6 +2146,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
+	/*
+	 * We delete all LP_DEAD items from the first heap pass in all indexes on
+	 * each call here.  This makes the next call to lazy_vacuum_heap_rel()
+	 * safe.
+	 */
+	Assert(vacrel->num_index_scans > 0 ||
+		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
+
 	/* Increase and report the number of index scans */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
@@ -2372,9 +2323,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks dead tuples as unused and compacts out free space on
- * their pages.  Pages not having dead tuples recorded from lazy_scan_heap are
- * not visited at all.
+ * This routine marks LP_DEAD items in vacrel->dead_tuples array as LP_UNUSED.
+ * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
+ * at all.
  *
  * Note: the reason for doing this as a second pass is we cannot remove the
  * tuples until we've removed their index entries, and we want to process
@@ -2419,12 +2370,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, tblk,
 								 RBM_NORMAL, vacrel->bstrategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		tupindex = lazy_vacuum_heap_page(vacrel, tblk, buf, tupindex,
 										 &vmbuffer);
 
@@ -2446,6 +2392,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vmbuffer = InvalidBuffer;
 	}
 
+	/*
+	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
+	 * the second heap pass.  No more, no less.
+	 */
+	Assert(vacrel->num_index_scans > 1 ||
+		   (tupindex == vacrel->lpdead_items &&
+			vacuumed_pages == vacrel->lpdead_item_pages));
+
 	ereport(elevel,
 			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
 					vacrel->relname, tupindex, vacuumed_pages),
@@ -2456,14 +2410,25 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free dead tuples on a page
- *						  and repair its fragmentation.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
+ *						  vacrel->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
  * this page.  We assume the rest follow sequentially.  The return value is
  * the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same page was visited by lazy_scan_prune() (i.e. those whose TID
+ * was recorded in the dead_tuples array).
+ *
+ * We cannot defragment the page here because that isn't safe while only
+ * holding an exclusive lock.
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
@@ -2499,11 +2464,15 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2513,12 +2482,19 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(vacrel->onerel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(vacrel->onerel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrel->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2531,10 +2507,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 97be4b0f23..9aab713684 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -484,8 +484,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0a75dccb93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3554,8 +3554,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3567,9 +3565,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v9-0004-Bypass-index-vacuuming-in-some-cases.patchapplication/octet-stream; name=v9-0004-Bypass-index-vacuuming-in-some-cases.patchDownload

From 87eded0876c21ea512906200a4a95c2586ba3153 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v9 4/4] Bypass index vacuuming in some cases.

Bypass index vacuuming in two cases: The case where there are so few
dead tuples that index vacuuming seems unnecessary, and the case where
the relfrozenxid of the table being vacuumed is dangerously far in the
past.

This commit add new GUC parameters vacuum_skip_index_age and
vacuum_multixact_skip_index_age that specify age at which VACUUM
should skip index cleanup to hurry finishing in order to
advance relfrozenxid/relminmxid.

After each index vacuuming (in non-parallel vacuum case), we check if
the table's relfrozenxid/relminmxid are too old comparing those new
GUC parameters. If so, we skip further index vacuuming within the
vacuum operation.

This behavior is intended to deal with the risk of XID wraparound, the
default values are much higher, 1.8 billion.

Although users can set those parameters, VACUUM will silently
adjust the effective value more than 105% of
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so that
only anti-wraparound autovacuuma and aggressive scan have a change to
skip index vacuuming.
---
 src/include/commands/vacuum.h                 |   4 +
 src/backend/access/heap/vacuumlazy.c          | 264 ++++++++++++++++--
 src/backend/commands/vacuum.c                 |  61 ++++
 src/backend/utils/misc/guc.c                  |  25 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 doc/src/sgml/config.sgml                      |  51 ++++
 doc/src/sgml/maintenance.sgml                 |  10 +-
 7 files changed, 397 insertions(+), 20 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..d3d44d9bac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_skip_index_age;
+extern int	vacuum_multixact_skip_index_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -270,6 +272,8 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  TransactionId *xidFullScanLimit,
 								  MultiXactId *multiXactCutoff,
 								  MultiXactId *mxactFullScanLimit);
+extern bool vacuum_xid_limit_emergency(TransactionId relfrozenxid,
+									   MultiXactId   relminmxid);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d4123048b6..90630d109e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -103,6 +103,14 @@
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
 
+/*
+ * Threshold that controls whether we bypass index vacuuming and heap
+ * vacuuming.  When we're under the threshold they're deemed unnecessary.
+ * BYPASS_THRESHOLD_NPAGES is applied as a multiplier on the table's rel_pages
+ * for those pages known to contain one or more LP_DEAD items.
+ */
+#define BYPASS_THRESHOLD_NPAGES	0.02	/* i.e. 2% of rel_pages */
+
 /*
  * When a table has no indexes, vacuum the FSM after every 8GB, approximately
  * (it won't be exact because we only vacuum FSM after processing a heap page
@@ -402,8 +410,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							GlobalVisState *vistest,
 							LVPagePruneState *pageprunestate,
 							LVPageVisMapState *pagevmstate);
-static void lazy_vacuum(LVRelState *vacrel);
-static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void lazy_vacuum(LVRelState *vacrel, bool onecall);
+static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
 													double reltuples,
@@ -752,6 +760,31 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
+			if (vacrel->rel_pages > 0)
+			{
+				if (vacrel->do_index_vacuuming)
+				{
+					if (vacrel->num_index_scans == 0)
+						appendStringInfo(&buf, _("index scan not needed:"));
+					else
+						appendStringInfo(&buf, _("index scan needed:"));
+					msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
+				}
+				else
+				{
+					Assert(vacrel->nindexes > 0);
+
+					if (vacrel->do_index_cleanup)
+						appendStringInfo(&buf, _("index scan bypassed:"));
+					else
+						appendStringInfo(&buf, _("index scan bypassed due to emergency:"));
+					msgfmt = _(" %u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
+				}
+				appendStringInfo(&buf, msgfmt,
+								 vacrel->lpdead_item_pages,
+								 100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+								 (long long) vacrel->lpdead_items);
+			}
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -842,7 +875,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -1109,11 +1143,22 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			}
 
 			/* Remove the collected garbage tuples from table and indexes */
-			lazy_vacuum(vacrel);
+			lazy_vacuum(vacrel, false);
+			have_vacuumed_indexes = true;
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 *
+			 * Note also that it's possible that the call to lazy_vacuum()
+			 * decided to end index vacuuming due to an emergency (though not
+			 * for any other reason).  When that happens we can miss out on
+			 * some of the free space that we originally expected to be able
+			 * to pick up within lazy_vacuum_heap_rel().
+			 *
+			 * We do at least start saving free space eagerly from this point
+			 * on should this happen.  That is, we set 'savefreespace' from
+			 * here on (just like the single heap pass/"nindexes == 0" case).
 			 */
 			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
 									blkno);
@@ -1257,7 +1302,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		if (vacrel->nindexes > 0 && pageprunestate.has_lpdead_items &&
 			vacrel->do_index_vacuuming)
 		{
-			/* Wait until lazy_vacuum_heap_rel() to save free space */
+			/*
+			 * Wait until lazy_vacuum_heap_rel() to save free space.
+			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap_rel() -- lazy_vacuum() might opt to skip index
+			 * vacuuming (and so must skip heap vacuuming).  This is deemed
+			 * okay because it only happens in emergencies, or when there is
+			 * very little free space anyway.
+			 */
 		}
 		else
 		{
@@ -1356,13 +1409,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-		lazy_vacuum(vacrel);
+		lazy_vacuum(vacrel, !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
@@ -1386,6 +1438,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * If table has no indexes and at least one heap pages was vacuumed, make
 	 * log report that lazy_vacuum_heap_rel would've made had there been
 	 * indexes (having indexes implies using the two pass strategy).
+	 *
+	 * We deliberately don't do this in the case where there are indexes but
+	 * index vacuuming was bypassed.  We make a similar report at the point
+	 * that index vacuuming is bypassed, but that's actually quite different
+	 * in one important sense: it shows information about work we _haven't_
+	 * done.
+	 *
+	 * log_autovacuum output does things differently; it consistently presents
+	 * information about LP_DEAD items for the VACUUM as a whole.  We always
+	 * report on each round of index and heap vacuuming separately, though.
 	 */
 	if (vacrel->nindexes == 0 && vacrel->lpdead_item_pages > 0)
 		ereport(elevel,
@@ -2084,10 +2146,19 @@ retry:
 
 /*
  * Remove the collected garbage tuples from the table and its indexes.
+ *
+ * We may choose to bypass index vacuuming at this point.
+ *
+ * In rare emergencies, the ongoing VACUUM operation can be made to skip both
+ * index vacuuming and index cleanup at the point we're called.  This avoids
+ * having the whole system refuse to allocate further XIDs/MultiXactIds due to
+ * wraparound.
  */
 static void
-lazy_vacuum(LVRelState *vacrel)
+lazy_vacuum(LVRelState *vacrel, bool onecall)
 {
+	bool		do_bypass_optimization;
+
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
 	Assert(!IsParallelWorker());
@@ -2100,11 +2171,139 @@ lazy_vacuum(LVRelState *vacrel)
 		return;
 	}
 
-	/* Okay, we're going to do index vacuuming */
-	lazy_vacuum_all_indexes(vacrel);
+	/*
+	 * Consider bypassing index vacuuming (and heap vacuuming) entirely.
+	 *
+	 * It's far from clear how we might assess the point at which bypassing
+	 * index vacuuming starts to make sense.  But it is at least clear that
+	 * VACUUM should not go ahead with index vacuuming in certain extreme
+	 * (though still fairly common) cases.  These are the cases where we have
+	 * _close to_ zero LP_DEAD items/TIDs to delete from indexes.  It would be
+	 * totally arbitrary to perform a round of full index scans in that case,
+	 * while not also doing the same thing when we happen to have _precisely_
+	 * zero TIDs -- so we do neither.  This avoids sharp discontinuities in
+	 * the duration and overhead of successive VACUUM operations that run
+	 * against the same table with the same workload.
+	 *
+	 * Our approach is to bypass index vacuuming only when there are very few
+	 * heap pages with dead items.  Even then, it must be the first and last
+	 * call here for the VACUUM.  We never apply the optimization when
+	 * multiple index scans will be required -- we cannot accumulate "debt"
+	 * without bound.
+	 *
+	 * This threshold we apply allows us to not give as much weight to items
+	 * that are concentrated in relatively few heap pages.  Concentrated
+	 * build-up of LP_DEAD items tends to occur with workloads that have
+	 * non-HOT updates that affect the same logical rows again and again.  It
+	 * is probably not possible for us to keep the visibility map bits for
+	 * these pages set for a useful amount of time anyway.
+	 *
+	 * We apply one further check: the space currently used to store the TIDs
+	 * (the TIDs that tie back to the index tuples we're thinking about not
+	 * deleting this time around) must not exceed 64MB.  This limits the risk
+	 * that we will bypass index vacuuming again and again until eventually
+	 * there is a VACUUM whose dead_tuples space is not resident in L3 cache.
+	 *
+	 * We can be conservative about avoiding eventually reaching some kind of
+	 * cliff edge while still avoiding almost all truly unnecessary index
+	 * vacuuming.
+	 */
+	do_bypass_optimization = false;
+	if (onecall && vacrel->rel_pages > 0)
+	{
+		BlockNumber threshold;
 
-	/* Remove tuples from heap */
-	lazy_vacuum_heap_rel(vacrel);
+		Assert(vacrel->num_index_scans == 0);
+		Assert(vacrel->lpdead_items == vacrel->dead_tuples->num_tuples);
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_NPAGES;
+
+		do_bypass_optimization =
+				(vacrel->lpdead_item_pages < threshold &&
+				 vacrel->lpdead_items < MAXDEADTUPLES(64L * 1024L * 1024L));
+	}
+
+	if (do_bypass_optimization)
+	{
+		/*
+		 * Bypass index vacuuming.
+		 *
+		 * Since VACUUM aims to behave as if there were precisely zero index
+		 * tuples, even when there are actually slightly more than zero, we
+		 * will still do index cleanup.  This is expected to have practically
+		 * no overhead with tables where bypassing index vacuuming helps.
+		 */
+		vacrel->do_index_vacuuming = false;
+		ereport(elevel,
+				(errmsg("\"%s\": index scan bypassed: %u pages from table (%.2f%% of total) have %lld dead item identifiers",
+						vacrel->relname, vacrel->rel_pages,
+						100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+						(long long) vacrel->lpdead_items)));
+	}
+	else if (lazy_vacuum_all_indexes(vacrel))
+	{
+		/*
+		 * We successfully completed a round of index vacuuming.  Do related
+		 * heap vacuuming now.
+		 *
+		 * There will be no calls to vacuum_xid_limit_emergency() to check for
+		 * issues with the age of the table's relfrozenxid unless and until
+		 * there is another call here -- heap vacuuming doesn't do that. This
+		 * should be okay, because the cost of a round of heap vacuuming is
+		 * much more linear.  Also, it has costs that are unaffected by the
+		 * number of indexes total.
+		 */
+		lazy_vacuum_heap_rel(vacrel);
+	}
+	else
+	{
+		/*
+		 * Emergency case:  We attempted index vacuuming, didn't finish
+		 * another round of index vacuuming (or one that reliably deleted
+		 * tuples from all of the table's indexes, at least).  This happens
+		 * when the table's relfrozenxid is too far in the past.
+		 *
+		 * From this point on the VACUUM operation will do no further index
+		 * vacuuming or heap vacuuming.  It will do any remaining pruning that
+		 * is required, plus other heap-related and relation-level maintenance
+		 * tasks.  But that's it.  We also disable a cost delay when a delay
+		 * is in effect.
+		 *
+		 * Note that we deliberately don't vary our behavior based on factors
+		 * like whether or not the ongoing VACUUM is aggressive.  If it's not
+		 * aggressive we probably won't be able to advance relfrozenxid during
+		 * this VACUUM.  If we can't, then an anti-wraparound VACUUM should
+		 * take place immediately after we finish up.  We should be able to
+		 * bypass all index vacuuming for the later anti-wraparound VACUUM.
+		 */
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+		ereport(WARNING,
+				(errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a fail safe after %d index scans",
+						get_database_name(MyDatabaseId),
+						vacrel->relname,
+						vacrel->relname,
+						vacrel->num_index_scans),
+				 errdetail("table's relfrozenxid or relminmxid is too far in the past"),
+				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
+						 "You might also need to consider other ways for VACUUM to keep up with the allocation of transaction IDs.")));
+
+		/* Stop applying cost limits from this point on */
+		VacuumCostActive = false;
+		VacuumCostBalance = 0;
+	}
+
+	/*
+	 * TODO:
+	 *
+	 * Call lazy_space_free() and arrange to stop even recording TIDs (i.e.
+	 * make lazy_record_dead_item() into a no-op)
+	 */
 
 	/*
 	 * Forget the now-vacuumed tuples -- just press on
@@ -2114,16 +2313,30 @@ lazy_vacuum(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Returns true in the common case when all indexes were successfully
+ * vacuumed.  Returns false in rare cases where we determined that the ongoing
+ * VACUUM operation is at risk of taking too long to finish, leading to
+ * wraparound failure.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
+	bool		allindexes = true;
+
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
+	/* Precheck for XID wraparound emergencies */
+	if (vacuum_xid_limit_emergency(vacrel->relfrozenxid, vacrel->relminmxid))
+	{
+		/* Wraparound emergency -- don't even start an index scan */
+		return false;
+	}
+
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2138,26 +2351,43 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			vacrel->indstats[idx] =
 				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
 									  vacrel);
+
+			if (vacuum_xid_limit_emergency(vacrel->relfrozenxid,
+										   vacrel->relminmxid))
+			{
+				/* Wraparound emergency -- end current index scan */
+				allindexes = false;
+				break;
+			}
 		}
 	}
 	else
 	{
+		/* Note: parallel VACUUM only gets the precheck */
+		allindexes = true;
+
 		/* Outsource everything to parallel variant */
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
 	/*
 	 * We delete all LP_DEAD items from the first heap pass in all indexes on
-	 * each call here.  This makes the next call to lazy_vacuum_heap_rel()
-	 * safe.
+	 * each call here (except calls where we don't finish all indexes).  This
+	 * makes the next call to lazy_vacuum_heap_rel() safe.
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
 		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
 
-	/* Increase and report the number of index scans */
+	/*
+	 * Increase and report the number of index scans.  Note that we include
+	 * the case where we started a round index scanning that we weren't able
+	 * to finish.
+	 */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrel->num_index_scans);
+
+	return allindexes;
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 662aff04b4..d3ff2de81c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_skip_index_age;
+int			vacuum_multixact_skip_index_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
@@ -1134,6 +1136,65 @@ vacuum_set_xid_limits(Relation rel,
 	}
 }
 
+/*
+ * vacuum_xid_limit_emergency() -- Handle wraparound emergencies
+ *
+ * Input parameters are the target relation's relfrozenxid and relminmxid.
+ */
+bool
+vacuum_xid_limit_emergency(TransactionId relfrozenxid, MultiXactId relminmxid)
+{
+	TransactionId xid_skip_limit;
+	MultiXactId	  multi_skip_limit;
+	int			  skip_index_vacuum;
+
+	Assert(TransactionIdIsNormal(relfrozenxid));
+	Assert(MultiXactIdIsValid(relminmxid));
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_skip_index_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(relfrozenxid) &&
+		TransactionIdPrecedes(relfrozenxid, xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_skip_index_age,
+							autovacuum_multixact_freeze_max_age * 1.05);
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(relminmxid) &&
+		MultiXactIdPrecedes(relminmxid, multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0c5dc4d3e8..24fb736a72 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2622,6 +2622,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_multixact_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3222,7 +3242,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_skip_index_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b234a6bfe6..7d6564e17f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -673,6 +673,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_skip_index_age = 1800000000
+#vacuum_multixact_skip_index_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..9a21e4a402 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8528,6 +8528,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-skip-index-age" xreflabel="vacuum_skip_index_age">
+      <term><varname>vacuum_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only anti-wraparound
+        autovacuums and aggressive scans have a chance to skip index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8574,6 +8599,32 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-vacuum-skip-index-age" xreflabel="vacuum_multixact_skip_index_age">
+      <term><varname>vacuum_multixact_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion multixacts. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 4d8ad754f8..4d3674c1b4 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -607,8 +607,14 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
 
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
+    system will begin to skip index cleanup to hurry finishing vacuum
+    operation. <xref linkend="guc-vacuum-skip-index-age"/> controls when
+    <command>VACUUM</command> and autovacuum do that.
+   </para>
+
+    <para>
+     The system emits warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
 <programlisting>
 WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-- 
2.27.0

v9-0002-Refactor-lazy_scan_heap.patchapplication/octet-stream; name=v9-0002-Refactor-lazy_scan_heap.patchDownload

From 2ad82c44e240478f23f318234ec272112d5193d0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v9 2/4] Refactor lazy_scan_heap().

Break lazy_scan_heap() up into several new subsidiary functions.  The
largest and most important new subsidiary function handles heap pruning
and tuple freezing.  This is preparation for an upcoming patch to remove
the "tupgone" special case from vacuumlazy.c.

Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off
from the logic used by single-heap-pass VACUUMs.  The former case is now
structured as the omission of index and heap vacuuming by a two pass
VACUUM.  The latter case goes back to being used only when the table
happens to have no indexes.  This is simpler and more natural -- the
whole point of INDEX_CLEANUP=off is to skip the index and heap vacuuming
that would otherwise take place.  The single-heap-pass case doesn't skip
anything, though -- it just does heap vacuuming in the same single pass
over the heap as pruning (which is only safe with a table that happens
to have no indexes).

Also fix a very old bug in single-pass VACUUM VERBOSE output.  We were
reporting the number of tuples deleted via pruning as a direct
substitute for reporting the number of LP_DEAD items removed in a
function that deals with the second pass over the heap.  But that
doesn't work at all -- they're two different things.

To fix, start tracking the total number of LP_DEAD items encountered
during pruning, and use that in the report instead.  A single pass
VACUUM will always vacuum away whatever LP_DEAD items a heap page has
immediately after it is pruned, so the total number of LP_DEAD items
encountered during pruning equals the total number vacuumed-away.
(They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay
because skipping index vacuuming is now a totally orthogonal concept to
one-pass VACUUM.)

Also stop reporting empty_pages in VACUUM VERBOSE output, and start
reporting pages_removed instead.  This makes the output of VACUUM
VERBOSE more consistent with log_autovacuum's output.  The empty_pages
item doesn't seem very useful.
---
 src/backend/access/heap/vacuumlazy.c  | 1438 +++++++++++++++----------
 contrib/pg_visibility/pg_visibility.c |    8 +-
 contrib/pgstattuple/pgstatapprox.c    |    9 +-
 3 files changed, 863 insertions(+), 592 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e8d56fa060..a36f0afd1e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -291,8 +291,9 @@ typedef struct LVRelState
 	Relation	onerel;
 	Relation   *indrels;
 	int			nindexes;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* Do index vacuuming/cleanup? */
+	bool		do_index_vacuuming;
+	bool		do_index_cleanup;
 
 	/* Buffer access strategy and parallel state */
 	BufferAccessStrategy bstrategy;
@@ -330,6 +331,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
 	BlockNumber tupcount_pages; /* pages whose tuples we counted */
 	BlockNumber pages_removed;	/* pages remove by truncation */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	bool		lock_waiter_detected;
 
@@ -342,6 +344,7 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 	int64		tuples_deleted; /* # deleted from table */
+	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
 									 * table */
 	int64		num_tuples;		/* total number of nonremovable tuples */
@@ -349,6 +352,29 @@ typedef struct LVRelState
 	int64		nunused;		/* # existing unused line pointers */
 } LVRelState;
 
+/*
+ * State set up and maintained in lazy_scan_heap() (also maintained in
+ * lazy_scan_prune()) that represents VM bit status.
+ *
+ * Used by lazy_scan_setvmbit() when we're done pruning.
+ */
+typedef struct LVPageVisMapState
+{
+	bool		all_visible_according_to_vm;
+	TransactionId visibility_cutoff_xid;
+} LVPageVisMapState;
+
+/*
+ * State output by lazy_scan_prune()
+ */
+typedef struct LVPagePruneState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+} LVPagePruneState;
+
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
 {
@@ -364,8 +390,21 @@ static int	elevel = -1;
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
+static bool lazy_scan_needs_freeze(Buffer buf, bool *hastup,
+								   LVRelState *vacrel);
+static void lazy_scan_new_page(LVRelState *vacrel, Buffer buf);
+static void lazy_scan_empty_page(LVRelState *vacrel, Buffer buf,
+								 Buffer vmbuffer);
+static void lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf,
+							   Buffer vmbuffer,
+							   LVPagePruneState *pageprunestate,
+							   LVPageVisMapState *pagevmstate);
+static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
+							GlobalVisState *vistest,
+							LVPagePruneState *pageprunestate,
+							LVPageVisMapState *pagevmstate,
+							VacOptTernaryValue index_cleanup);
+static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
 													IndexBulkDeleteResult *istat,
@@ -384,13 +423,11 @@ static void update_index_statistics(LVRelState *vacrel);
 static bool should_attempt_truncation(LVRelState *vacrel,
 									  VacuumParams *params);
 static void lazy_truncate_heap(LVRelState *vacrel);
-static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
-								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
+static BlockNumber lazy_truncate_count_nondeletable(LVRelState *vacrel);
 static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
 static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
 							 BlockNumber relblocks);
@@ -515,8 +552,13 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	vacrel->onerel = onerel;
 	vac_open_indexes(vacrel->onerel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
-	vacrel->useindex = (vacrel->nindexes > 0 &&
-						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->do_index_vacuuming = true;
+	vacrel->do_index_cleanup = true;
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+	{
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+	}
 	vacrel->bstrategy = bstrategy;
 	vacrel->lps = NULL;			/* for now */
 	vacrel->old_rel_pages = onerel->rd_rel->relpages;
@@ -808,8 +850,8 @@ vacuum_log_cleanup_info(LVRelState *vacrel)
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
  *		all-visible if appropriate.  When done, or when we run low on space
- *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
- *		pointers.
+ *		for dead-tuple TIDs, invoke lazy_vacuum to vacuum indexes and vacuum
+ *		heap relation during its own second pass over the heap.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -832,22 +874,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
-	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -870,23 +902,23 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relnamespace,
 						vacrel->relname)));
 
-	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
-
 	nblocks = RelationGetNumberOfBlocks(vacrel->onerel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
 	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
+	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
 	vacrel->lock_waiter_detected = false;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
+	vacrel->lpdead_items = 0;
 	vacrel->new_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
@@ -903,7 +935,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 */
 	lazy_space_alloc(vacrel, params->nworkers, nblocks);
 	dead_tuples = vacrel->dead_tuples;
-	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -955,7 +986,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -989,20 +1019,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
+		LVPageVisMapState pagevmstate;
+		LVPagePruneState pageprunestate;
+		bool		savefreespace;
 		Size		freespace;
-		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
 
-		/* see note above about forcing scanning of last page */
+		/*
+		 * Initialize vm state for page
+		 *
+		 * Can't touch pageprunestate for page until we reach
+		 * lazy_scan_prune(), though -- that's output state only
+		 */
+		pagevmstate.all_visible_according_to_vm = false;
+		pagevmstate.visibility_cutoff_xid = InvalidTransactionId;
+
+		/*
+		 * Step 1 for block: Consider need to skip blocks.
+		 *
+		 * See note above about forcing scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
@@ -1055,7 +1090,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (aggressive && VM_ALL_VISIBLE(vacrel->onerel, blkno,
 											 &vmbuffer))
-				all_visible_according_to_vm = true;
+				pagevmstate.all_visible_according_to_vm = true;
 		}
 		else
 		{
@@ -1083,12 +1118,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
-			all_visible_according_to_vm = true;
+			pagevmstate.all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
 		/*
+		 * Step 2 for block: Consider if we definitely have enough space to
+		 * process TIDs on page already.
+		 *
 		 * If we are close to overrunning the available space for dead-tuple
 		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
 		 */
@@ -1107,24 +1145,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(vacrel);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap_rel(vacrel);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum(vacrel);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
+									blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1133,6 +1162,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
+		 * Step 3 for block: Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
@@ -1145,9 +1176,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		buf = ReadBufferExtended(vacrel->onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * Step 4 for block: Acquire super-exclusive lock for pruning.
+		 *
+		 * We need buffer cleanup lock so that we can prune HOT chains.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1178,7 +1215,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+			if (!lazy_scan_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
 				vacrel->scanned_pages++;
@@ -1204,6 +1241,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * Step 5 for block: Handle empty/new pages.
+		 *
+		 * By here we have a super-exclusive lock, and it's clear that this
+		 * page is one that we consider scanned
+		 */
 		vacrel->scanned_pages++;
 		vacrel->tupcount_pages++;
 
@@ -1211,396 +1254,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		if (PageIsNew(page))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			empty_pages++;
-
-			if (GetRecordedFreeSpace(vacrel->onerel, blkno) == 0)
-			{
-				Size		freespace;
-
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
-			}
+			/* Releases lock on buf for us: */
+			lazy_scan_new_page(vacrel, buf);
 			continue;
 		}
-
-		if (PageIsEmpty(page))
+		else if (PageIsEmpty(page))
 		{
-			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->onerel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+			/* Releases lock on buf for us (though keeps vmbuffer pin): */
+			lazy_scan_empty_page(vacrel, buf, vmbuffer);
 			continue;
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Step 6 for block: Do pruning.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Also accumulates details of remaining LP_DEAD line pointers on page
+		 * in dead tuple list.  This includes LP_DEAD line pointers that we
+		 * ourselves just pruned, as well as existing LP_DEAD line pointers
+		 * pruned earlier.
+		 *
+		 * Also handles tuple freezing -- considers freezing XIDs from all
+		 * tuple headers left behind following pruning.
 		 */
-		tups_vacuumed += heap_page_prune(vacrel->onerel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrel->latestRemovedXid,
-										 &vacrel->offnum);
+		lazy_scan_prune(vacrel, buf, vistest, &pageprunestate, &pagevmstate,
+						params->index_cleanup);
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Step 7 for block: Set up details for saving free space in FSM at
+		 * end of loop.  (Also performs extra single pass strategy steps in
+		 * "nindexes == 0" case.)
+		 *
+		 * If we have any LP_DEAD items on this page (i.e. any new dead_tuples
+		 * entries compared to just before lazy_scan_prune()) then the page
+		 * will be visited again by lazy_vacuum_heap_rel(), which will compute
+		 * and record its post-compaction free space.  If not, then we're done
+		 * with this page, so remember its free space as-is.
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		savefreespace = false;
+		freespace = 0;
+		if (vacrel->nindexes > 0 && pageprunestate.has_lpdead_items &&
+			vacrel->do_index_vacuuming)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrel->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
-			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrel->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
-				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
-				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  vacrel->relfrozenxid,
-											  vacrel->relminmxid,
-											  vacrel->FreezeLimit,
-											  vacrel->MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrel->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
+			/* Wait until lazy_vacuum_heap_rel() to save free space */
+		}
+		else
 		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(vacrel->onerel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(vacrel->onerel, buf,
-										 vacrel->FreezeLimit, frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
+			/* Save space right away */
+			savefreespace = true;
+			freespace = PageGetHeapFreeSpace(page);
 		}
 
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
+		if (vacrel->nindexes == 0 && pageprunestate.has_lpdead_items)
 		{
-			if (vacrel->nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
+			Assert(dead_tuples->num_tuples > 0);
 
 			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
+			 * One pass strategy (no indexes) case.
+			 *
+			 * Mark LP_DEAD item pointers for LP_UNUSED now, since there won't
+			 * be a second pass in lazy_vacuum_heap_rel().
 			 */
+			lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+
+			/* This won't have changed: */
+			Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
+
+			/*
+			 * Make sure lazy_scan_setvmbit() won't stop setting VM due to
+			 * now-vacuumed LP_DEAD items:
+			 */
+			pageprunestate.has_lpdead_items = false;
+
+			/* Forget the now-vacuumed tuples */
 			dead_tuples->num_tuples = 0;
 
 			/*
@@ -1611,115 +1336,34 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(vacrel->onerel, next_fsm_block_to_vacuum,
-										blkno);
+				FreeSpaceMapVacuumRange(vacrel->onerel,
+										next_fsm_block_to_vacuum, blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (all_frozen)
-				flags |= VISIBILITYMAP_ALL_FROZEN;
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
-		}
+		/* One pass strategy had better have no dead tuples by now: */
+		Assert(vacrel->nindexes > 0 || dead_tuples->num_tuples == 0);
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * Step 8 for block: Handle setting visibility map bit as appropriate
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->onerel, blkno, &vmbuffer))
-		{
-			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
+		lazy_scan_setvmbit(vacrel, buf, vmbuffer, &pageprunestate,
+						   &pagevmstate);
 
 		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
-		 * set, however.
+		 * Step 9 for block: drop super-exclusive lock, finalize page by
+		 * recording its free space in the FSM as appropriate
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
-		{
-			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->onerel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
-		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->onerel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(vacrel->onerel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
 
 		UnlockReleaseBuffer(buf);
-
 		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
+		if (pageprunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
-
-		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap_rel, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
-		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (savefreespace)
 			RecordPageWithFreeSpace(vacrel->onerel, blkno, freespace);
+
+		/* Finished all steps for block by here (at the latest) */
 	}
 
 	/* report that everything is scanned and vacuumed */
@@ -1728,16 +1372,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
 
-	pfree(frozen);
-
-	/* save stats for use later */
-	vacrel->tuples_deleted = tups_vacuumed;
-	vacrel->new_dead_tuples = nkeep;
-
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->onerel, nblocks,
 													 vacrel->tupcount_pages,
-													 live_tuples);
+													 vacrel->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1758,13 +1396,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(vacrel);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap_rel(vacrel);
-	}
+		lazy_vacuum(vacrel);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1778,29 +1410,37 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
 
 	/* Free resources managed by lazy_space_alloc() */
 	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
-	if (vacuumed_pages)
+	/*
+	 * If table has no indexes and at least one heap pages was vacuumed, make
+	 * log report that lazy_vacuum_heap_rel would've made had there been
+	 * indexes (having indexes implies using the two pass strategy).
+	 */
+	if (vacrel->nindexes == 0 && vacrel->lpdead_item_pages > 0)
 		ereport(elevel,
-				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrel->relname,
-						tups_vacuumed, vacuumed_pages)));
+				(errmsg("\"%s\": removed %lld dead item identifiers in %u pages",
+						vacrel->relname, (long long) vacrel->lpdead_items,
+						vacrel->lpdead_item_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
-					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, vacrel->OldestXmin);
-	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
+					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+	appendStringInfo(&buf, _("There were %lld unused item identifiers.\n"),
+					 (long long) vacrel->nunused);
+	appendStringInfo(&buf, ngettext("%u page removed.\n",
+									"%u pages removed.\n",
+									vacrel->pages_removed),
+					 vacrel->pages_removed);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrel->pinskipped_pages),
@@ -1809,30 +1449,27 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 									"%u frozen pages.\n",
 									vacrel->frozenskipped_pages),
 					 vacrel->frozenskipped_pages);
-	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
-									"%u pages are entirely empty.\n",
-									empty_pages),
-					 empty_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": found %lld removable, %lld nonremovable row versions in %u out of %u pages",
 					vacrel->relname,
-					tups_vacuumed, num_tuples,
-					vacrel->scanned_pages, nblocks),
+					(long long) vacrel->tuples_deleted,
+					(long long) vacrel->num_tuples, vacrel->scanned_pages,
+					nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
+ *	lazy_scan_needs_freeze() -- see if any tuples need to be cleaned to avoid
+ *	wraparound
  *
  * Returns true if the page needs to be vacuumed using cleanup lock.
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
+lazy_scan_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -1864,7 +1501,9 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
-		/* this should match hastup test in count_nondeletable_pages() */
+		/*
+		 * This should match hastup test in lazy_truncate_count_nondeletable()
+		 */
 		if (ItemIdIsUsed(itemid))
 			*hastup = true;
 
@@ -1885,6 +1524,648 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 	return (offnum <= maxoff);
 }
 
+/*
+ * Handle new page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf.
+ *
+ * All-zeroes pages can be left over if either a backend extends the relation
+ * by a single page, but crashes before the newly initialized page has been
+ * written out, or when bulk-extending the relation (which creates a number of
+ * empty pages at the tail end of the relation, but enters them into the FSM).
+ *
+ * Note we do not enter the page into the visibilitymap. That has the downside
+ * that we repeatedly visit this page in subsequent vacuums, but otherwise
+ * we'll never not discover the space on a promoted standby. The harm of
+ * repeated checking ought to normally not be too bad - the space usually
+ * should be used at some point, otherwise there wouldn't be any regular
+ * vacuums.
+ *
+ * Make sure these pages are in the FSM, to ensure they can be reused. Do that
+ * by testing if there's any space recorded for the page. If not, enter it. We
+ * do so after releasing the lock on the heap page, the FSM is approximate,
+ * after all.
+ */
+static void
+lazy_scan_new_page(LVRelState *vacrel, Buffer buf)
+{
+	Relation	onerel = vacrel->onerel;
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	if (GetRecordedFreeSpace(onerel, blkno) == 0)
+	{
+		Size		freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(onerel, blkno, freespace);
+		return;
+	}
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Handle empty page during lazy_scan_heap().
+ *
+ * Caller must hold pin and buffer cleanup lock on buf, as well as a pin (but
+ * not a lock) on vmbuffer.
+ */
+static void
+lazy_scan_empty_page(LVRelState *vacrel, Buffer buf, Buffer vmbuffer)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+	Size		freespace = PageGetHeapFreeSpace(page);
+
+	/*
+	 * Empty pages are always all-visible and all-frozen (note that the same
+	 * is currently not true for new pages, see lazy_scan_new_page()).
+	 */
+	if (!PageIsAllVisible(page))
+	{
+		START_CRIT_SECTION();
+
+		/* mark buffer dirty before writing a WAL record */
+		MarkBufferDirty(buf);
+
+		/*
+		 * It's possible that another backend has extended the heap,
+		 * initialized the page, and then failed to WAL-log the page due to an
+		 * ERROR.  Since heap extension is not WAL-logged, recovery might try
+		 * to replay our record setting the page all-visible and find that the
+		 * page isn't initialized, which will cause a PANIC.  To prevent that,
+		 * check whether the page has been previously WAL-logged, and if not,
+		 * do that now.
+		 */
+		if (RelationNeedsWAL(onerel) &&
+			PageGetLSN(page) == InvalidXLogRecPtr)
+			log_newpage_buffer(buf, true);
+
+		PageSetAllVisible(page);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+		END_CRIT_SECTION();
+	}
+
+	UnlockReleaseBuffer(buf);
+	RecordPageWithFreeSpace(onerel, blkno, freespace);
+}
+
+/*
+ * Handle setting VM bit inside lazy_scan_heap(), after pruning and freezing.
+ */
+static void
+lazy_scan_setvmbit(LVRelState *vacrel, Buffer buf, Buffer vmbuffer,
+				   LVPagePruneState *pageprunestate,
+				   LVPageVisMapState *pagevmstate)
+{
+	Relation	onerel = vacrel->onerel;
+	Page		page = BufferGetPage(buf);
+	BlockNumber blkno = BufferGetBlockNumber(buf);
+
+	/* mark page all-visible, if appropriate */
+	if (pageprunestate->all_visible &&
+		!pagevmstate->all_visible_according_to_vm)
+	{
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
+
+		if (pageprunestate->all_frozen)
+			flags |= VISIBILITYMAP_ALL_FROZEN;
+
+		/*
+		 * It should never be the case that the visibility map page is set
+		 * while the page-level bit is clear, but the reverse is allowed (if
+		 * checksums are not enabled).  Regardless, set both bits so that we
+		 * get back in sync.
+		 *
+		 * NB: If the heap page is all-visible but the VM bit is not set, we
+		 * don't need to dirty the heap page.  However, if checksums are
+		 * enabled, we do need to make sure that the heap page is dirtied
+		 * before passing it to visibilitymap_set(), because it may be logged.
+		 * Given that this situation should only happen in rare cases after a
+		 * crash, it is not worth optimizing.
+		 */
+		PageSetAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer,
+						  pagevmstate->visibility_cutoff_xid, flags);
+	}
+
+	/*
+	 * The visibility map bit should never be set if the page-level bit is
+	 * clear.  However, it's possible that the bit got cleared after we
+	 * checked it and before we took the buffer content lock, so we must
+	 * recheck before jumping to the conclusion that something bad has
+	 * happened.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel, blkno,
+													   &vmbuffer))
+	{
+		elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * It's possible for the value returned by
+	 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
+	 * wrong for us to see tuples that appear to not be visible to everyone
+	 * yet, while PD_ALL_VISIBLE is already set. The real safe xmin value
+	 * never moves backwards, but GetOldestNonRemovableTransactionId() is
+	 * conservative and sometimes returns a value that's unnecessarily small,
+	 * so if we see that contradiction it just means that the tuples that we
+	 * think are not visible to everyone yet actually are, and the
+	 * PD_ALL_VISIBLE flag is correct.
+	 *
+	 * There should never be dead tuples on a page with PD_ALL_VISIBLE set,
+	 * however.
+	 */
+	else if (PageIsAllVisible(page) && pageprunestate->has_lpdead_items)
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 RelationGetRelationName(onerel), blkno);
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+		visibilitymap_clear(onerel, blkno, vmbuffer,
+							VISIBILITYMAP_VALID_BITS);
+	}
+
+	/*
+	 * If the all-visible page is all-frozen but not marked as such yet, mark
+	 * it as all-frozen.  Note that all_frozen is only valid if all_visible is
+	 * true, so we must check both.
+	 */
+	else if (pagevmstate->all_visible_according_to_vm &&
+			 pageprunestate->all_visible && pageprunestate->all_frozen &&
+			 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+	{
+		/*
+		 * We can pass InvalidTransactionId as the cutoff XID here, because
+		 * setting the all-frozen bit doesn't cause recovery conflicts.
+		 */
+		visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+						  vmbuffer, InvalidTransactionId,
+						  VISIBILITYMAP_ALL_FROZEN);
+	}
+}
+
+/*
+ *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+lazy_scan_prune(LVRelState *vacrel, Buffer buf, GlobalVisState *vistest,
+				LVPagePruneState *pageprunestate,
+				LVPageVisMapState *pagevmstate,
+				VacOptTernaryValue index_cleanup)
+{
+	Relation	onerel = vacrel->onerel;
+	BlockNumber blkno;
+	Page		page;
+	OffsetNumber offnum,
+				maxoff;
+	ItemId		itemid;
+	HeapTupleData tuple;
+	int			tuples_deleted,
+				lpdead_items,
+				new_dead_tuples,
+				num_tuples,
+				live_tuples,
+				nunused;
+	int			nredirect PG_USED_FOR_ASSERTS_ONLY;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	OffsetNumber tupoffsets[MaxHeapTuplesPerPage];
+
+	blkno = BufferGetBlockNumber(buf);
+	page = BufferGetPage(buf);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Initialize (or reset) page-level counters */
+	tuples_deleted = 0;
+	lpdead_items = 0;
+	new_dead_tuples = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	nunused = 0;
+	nredirect = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as tuples_deleted.  Its
+	 * final value can be thought of as the number of tuples that have been
+	 * deleted from the table.  It should not be confused with lpdead_items;
+	 * lpdead_items's final value can be thought of as the number of tuples
+	 * that were deleted from indexes.
+	 */
+	tuples_deleted = heap_page_prune(onerel, buf, vistest,
+									 InvalidTransactionId, 0, false,
+									 &vacrel->latestRemovedXid,
+									 &vacrel->offnum);
+
+	/*
+	 * Now scan the page to collect vacuumable items and check for tuples
+	 * requiring freezing.
+	 */
+	pageprunestate->hastup = false;
+	pageprunestate->has_lpdead_items = false;
+	pageprunestate->all_visible = true;
+	pageprunestate->all_frozen = true;
+
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		bool		tupgone = false;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		/* Unused items only need to be counted for log message */
+		if (!ItemIdIsUsed(itemid))
+		{
+			nunused++;
+			continue;
+		}
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			pageprunestate->hastup = true;	/* page won't be truncatable */
+			nredirect++;
+			continue;
+		}
+
+		/* LP_DEAD items are processed outside of the loop */
+		if (ItemIdIsDead(itemid))
+		{
+			deadoffsets[lpdead_items++] = offnum;
+			pageprunestate->all_visible = false;
+			pageprunestate->has_lpdead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(onerel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					new_dead_tuples++;
+				else
+					tupgone = true; /* we can delete the tuple */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				live_tuples++;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (pageprunestate->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					{
+						pageprunestate->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin,
+											 pagevmstate->visibility_cutoff_xid))
+						pagevmstate->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.
+				 */
+				new_dead_tuples++;
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				pageprunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				pageprunestate->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				live_tuples++;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			/* Pretend that this is an LP_DEAD item  */
+			deadoffsets[lpdead_items++] = offnum;
+			pageprunestate->all_visible = false;
+			pageprunestate->has_lpdead_items = true;
+
+			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrel->latestRemovedXid);
+		}
+		else
+		{
+			/*
+			 * Each non-removable tuple must be checked to see if it needs
+			 * freezing
+			 */
+			tupoffsets[num_tuples++] = offnum;
+			pageprunestate->hastup = true;
+			/* Consider pageprunestate->all_frozen below, during freezing */
+		}
+	}
+
+	/*
+	 * We have now divided every item on the page into either an LP_DEAD item
+	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
+	 * that remains and needs to be considered for freezing now (LP_UNUSED and
+	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Add page level counters to caller's counts, and then actually process
+	 * LP_DEAD and LP_NORMAL items.
+	 *
+	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
+	 * to pretend that DEAD items are LP_DEAD items.
+	 */
+	Assert(lpdead_items + num_tuples + nunused + nredirect == maxoff);
+	vacrel->offnum = InvalidOffsetNumber;
+
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+	vacrel->nunused += nunused;
+
+	/*
+	 * Consider the need to freeze any items with tuple storage from the page
+	 * first (arbitrary)
+	 */
+	if (num_tuples > 0)
+	{
+		xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+		int					 nfrozen = 0;
+
+		Assert(pageprunestate->hastup);
+
+		for (int i = 0; i < num_tuples; i++)
+		{
+			OffsetNumber item = tupoffsets[i];
+			bool		tuple_totally_frozen;
+
+			ItemPointerSet(&(tuple.t_self), blkno, item);
+			itemid = PageGetItemId(page, item);
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			Assert(ItemIdIsNormal(itemid) && ItemIdHasStorage(itemid));
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(vacrel->onerel);
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  vacrel->relfrozenxid,
+										  vacrel->relminmxid,
+										  vacrel->FreezeLimit,
+										  vacrel->MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+				frozen[nfrozen++].offset = item;
+			if (!tuple_totally_frozen)
+				pageprunestate->all_frozen = false;
+		}
+
+		if (nfrozen > 0)
+		{
+			/*
+			 * At least one tuple with storage needs to be frozen -- execute
+			 * that now.
+			 *
+			 * If we need to freeze any tuples we'll mark the buffer dirty,
+			 * and write a WAL record recording the changes.  We must log the
+			 * changes to be crash-safe against future truncation of CLOG.
+			 */
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buf);
+
+			/* execute collected freezes */
+			for (int i = 0; i < nfrozen; i++)
+			{
+				HeapTupleHeader htup;
+
+				itemid = PageGetItemId(page, frozen[i].offset);
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				heap_execute_freeze_tuple(htup, &frozen[i]);
+			}
+
+			/* Now WAL-log freezing if necessary */
+			if (RelationNeedsWAL(vacrel->onerel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = log_heap_freeze(vacrel->onerel, buf, vacrel->FreezeLimit,
+										 frozen, nfrozen);
+				PageSetLSN(page, recptr);
+			}
+
+			END_CRIT_SECTION();
+		}
+	}
+
+	/*
+	 * The second pass over the heap can also set visibility map bits, using
+	 * the same approach.  This is important when the table frequently has a
+	 * few old LP_DEAD items on each page by the time we get to it (typically
+	 * because past opportunistic pruning operations freed some non-HOT
+	 * tuples).
+	 *
+	 * VACUUM will call heap_page_is_all_visible() during the second pass over
+	 * the heap to determine all_visible and all_frozen for the page -- this
+	 * is a specialized version of the logic from this function.  Now that
+	 * we've finished pruning and freezing, make sure that we're in total
+	 * agreement with heap_page_is_all_visible() using an assertion.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	/* Note that all_frozen value does not matter when !all_visible */
+	if (pageprunestate->all_visible)
+	{
+		TransactionId cutoff;
+		bool		  all_frozen;
+
+		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
+			Assert(false);
+
+		Assert(lpdead_items == 0);
+		Assert(pageprunestate->all_frozen == all_frozen);
+
+		/*
+		 * It's possible that we froze tuples and made the page's XID cutoff
+		 * (for recovery conflict purposes) FrozenTransactionId.  This is okay
+		 * because visibility_cutoff_xid will be logged by our caller in a
+		 * moment.
+		 */
+		Assert(cutoff == FrozenTransactionId ||
+			   cutoff == pagevmstate->visibility_cutoff_xid);
+	}
+#endif
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array.  Also record that page has dead items in per-page prunestate.
+	 */
+	if (lpdead_items > 0)
+	{
+		LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+		ItemPointerData tmp;
+
+		Assert(!pageprunestate->all_visible);
+		Assert(pageprunestate->has_lpdead_items);
+
+		vacrel->lpdead_item_pages++;
+
+		/*
+		 * Don't actually save item when it is known for sure that both index
+		 * vacuuming and heap vacuuming cannot go ahead during the ongoing VACUUM
+		 */
+		if (!vacrel->do_index_vacuuming && vacrel->nindexes > 0)
+			return;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_tuples->itemptrs[dead_tuples->num_tuples++] = tmp;
+		}
+
+		Assert(dead_tuples->num_tuples <= dead_tuples->max_tuples);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_tuples->num_tuples);
+	}
+}
+
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum(LVRelState *vacrel)
+{
+	/* Should not end up here with no indexes */
+	Assert(vacrel->nindexes > 0);
+	Assert(!IsParallelWorker());
+	Assert(vacrel->lpdead_item_pages > 0);
+
+	if (!vacrel->do_index_vacuuming)
+	{
+		Assert(!vacrel->do_index_cleanup);
+		vacrel->dead_tuples->num_tuples = 0;
+		return;
+	}
+
+	/* Okay, we're going to do index vacuuming */
+	lazy_vacuum_all_indexes(vacrel);
+
+	/* Remove tuples from heap */
+	lazy_vacuum_heap_rel(vacrel);
+
+	/*
+	 * Forget the now-vacuumed tuples -- just press on
+	 */
+	vacrel->dead_tuples->num_tuples = 0;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
@@ -1892,6 +2173,8 @@ static void
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	Assert(vacrel->nindexes > 0);
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
@@ -2106,6 +2389,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
@@ -2190,6 +2477,8 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
+
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Update error traceback information */
@@ -2433,7 +2722,7 @@ lazy_truncate_heap(LVRelState *vacrel)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(vacrel);
+		new_rel_pages = lazy_truncate_count_nondeletable(vacrel);
 		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
@@ -2482,7 +2771,7 @@ lazy_truncate_heap(LVRelState *vacrel)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(LVRelState *vacrel)
+lazy_truncate_count_nondeletable(LVRelState *vacrel)
 {
 	Relation	onerel = vacrel->onerel;
 	BlockNumber blkno;
@@ -2622,14 +2911,14 @@ count_nondeletable_pages(LVRelState *vacrel)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -2712,26 +3001,6 @@ lazy_space_free(LVRelState *vacrel)
 	end_parallel_vacuum(vacrel);
 }
 
-/*
- * lazy_record_dead_tuple - remember one deletable tuple
- */
-static void
-lazy_record_dead_tuple(LVDeadTuples *dead_tuples, ItemPointer itemptr)
-{
-	/*
-	 * The array shouldn't overflow under normal behavior, but perhaps it
-	 * could if we are given a really small maintenance_work_mem. In that
-	 * case, just forget the last few tuples (we'll get 'em next time).
-	 */
-	if (dead_tuples->num_tuples < dead_tuples->max_tuples)
-	{
-		dead_tuples->itemptrs[dead_tuples->num_tuples] = *itemptr;
-		dead_tuples->num_tuples++;
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_tuples->num_tuples);
-	}
-}
-
 /*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
@@ -2822,7 +3091,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 
 	/*
 	 * This is a stripped down version of the line pointer scan in
-	 * lazy_scan_heap(). So if you change anything here, also check that code.
+	 * lazy_scan_new_page. So if you change anything here, also check that
+	 * code.
 	 */
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2868,7 +3138,7 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 				{
 					TransactionId xmin;
 
-					/* Check comments in lazy_scan_heap. */
+					/* Check comments in lazy_scan_new_page() */
 					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
 					{
 						all_visible = false;
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..6bfc48c64a 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -756,10 +756,10 @@ tuple_all_visible(HeapTuple tup, TransactionId OldestXmin, Buffer buffer)
 		return false;			/* all-visible implies live */
 
 	/*
-	 * Neither lazy_scan_heap nor heap_page_is_all_visible will mark a page
-	 * all-visible unless every tuple is hinted committed. However, those hint
-	 * bits could be lost after a crash, so we can't be certain that they'll
-	 * be set here.  So just check the xmin.
+	 * Neither lazy_scan_heap/lazy_scan_new_page nor heap_page_is_all_visible
+	 * will mark a page all-visible unless every tuple is hinted committed.
+	 * However, those hint bits could be lost after a crash, so we can't be
+	 * certain that they'll be set here.  So just check the xmin.
 	 */
 
 	xmin = HeapTupleHeaderGetXmin(tup->t_data);
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 1fe193bb25..adf4a61aac 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -58,8 +58,8 @@ typedef struct output_type
  * and approximate tuple_len on that basis. For the others, we count
  * the exact number of dead tuples etc.
  *
- * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
- * we do not try to avoid skipping single pages.
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap and
+ * lazy_scan_new_page, but we do not try to avoid skipping single pages.
  */
 static void
 statapprox_heap(Relation rel, output_type *stat)
@@ -126,8 +126,9 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		/*
 		 * Look at each tuple on the page and decide whether it's live or
-		 * dead, then count it and its size. Unlike lazy_scan_heap, we can
-		 * afford to ignore problems and special cases.
+		 * dead, then count it and its size. Unlike lazy_scan_heap and
+		 * lazy_scan_new_page, we can afford to ignore problems and special
+		 * cases.
 		 */
 		maxoff = PageGetMaxOffsetNumber(page);
 
-- 
2.27.0

#91

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#90)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 31, 2021 at 12:01 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Mar 28, 2021 at 9:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

And now here's v8, which has the following additional cleanup:

And here's v9, which has improved commit messages for the first 2
patches, and many small tweaks within all 4 patches.

The most interesting change is that lazy_scan_heap() now has a fairly
elaborate assertion that verifies that its idea about whether or not
the page is all_visible and all_frozen is shared by
heap_page_is_all_visible() -- this is a stripped down version of the
logic that now lives in lazy_scan_heap(). It exists so that the second
pass over the heap can set visibility map bits.

Thank you for updating the patches.

Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
merge them? I basically agree with the refactoring made by 0001 patch
but I'm concerned a bit that having such a large refactoring at very
close to feature freeze could be risky. We would need more eyes to
review during stabilization.

Here are some comments on 0001 patch:

-/*
- * Macro to check if we are in a parallel vacuum. If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-

I think it's more clear to use this macro. The macro can be like this:

ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

---
 /*
- * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
- * This is allocated in the DSM segment in parallel mode and in local memory
- * in non-parallel mode.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
  */

The second sentence of the removed lines still seems to be useful
information for readers?

---
- *
- * Note that vacrelstats->dead_tuples
could have tuples which
- * became dead after HOT-pruning but
are not marked dead yet.
- * We do not process them because it's
a very rare condition,
- * and the next vacuum will process them anyway.

Maybe the above comments should not be removed by 0001 patch.

---
+       /* Free resources managed by lazy_space_alloc() */
+       lazy_space_free(vacrel);

and

+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+       if (!vacrel->lps)
+               return;
+
+       /*
+        * End parallel mode before updating index statistics as we cannot write
+        * during parallel mode.
+        */
+       end_parallel_vacuum(vacrel);

Looking at the comments, I thought that this function also frees
palloc'd dead tuple space but it doesn't. It seems to more clear that
doing pfree(vacrel->dead_tuples) here or not creating
lazy_space_free().

Also, the comment for end_paralle_vacuum() looks not relevant with
this function. Maybe we can update to:

/* Exit parallel mode and free the parallel context */

---
+       if (shared_istat)
+       {
+               /* Get the space for IndexBulkDeleteResult */
+               bulkdelete_res = &(shared_istat->istat);
+
+               /*
+                * Update the pointer to the corresponding
bulk-deletion result if
+                * someone has already updated it.
+                */
+               if (shared_istat->updated && istat == NULL)
+                       istat = bulkdelete_res;
+       }

(snip)

+       if (shared_istat && !shared_istat->updated && istat != NULL)
+       {
+               memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+               shared_istat->updated = true;
+
+               /*
+                * Now that top-level indstats[idx] points to the DSM
segment, we
+                * don't need the locally allocated results.
+                */
+               pfree(istat);
+               istat = bulkdelete_res;
+       }
+
+       return istat;

If we have parallel_process_one_index() return the address of
IndexBulkDeleteResult, we can simplify the first part above. Also, it
seems better to use a separate variable from istat to store the
result. How about the following structure?

IndexBulkDeleteResult *istat_res;

/*
* Update the pointer of the corresponding bulk-deletion result if
* someone has already updated it.
*/
if (shared_istat && shared_istat->updated && istat == NULL)
istat = shared_istat->istat;

/* Do vacuum or cleanup of the index */
if (lvshared->for_cleanup)
istat_res = lazy_cleanup_one_index(indrel, istat, ...);
else
istat_res = lazy_vacuum_one_index(indrel, istat, ...);

/*
* (snip)
*/
if (shared_istat && !shared_istat->updated && istat_res != NULL)
{
memcpy(shared_istat->istat, istat_res, sizeof(IndexBulkDeleteResult));
shared_istat->updated = true;

/* free the locally-allocated bulk-deletion result */
pfree(istat_res);

/* return the pointer to the result on the DSM segment */
return shared_istat->istat;
}

return istat_res;

Comment on 0002 patch:

+           /* This won't have changed: */
+           Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));

This assertion can be false because freespace can be 0 if the page's
PD_HAS_FREE_LINES hint can wrong. Since lazy_vacuum_heap_page() fixes
it, PageGetHeapFreeSpace(page) in the assertion returns non-zero
value.

And, here are commends on 0004 patch:

+               ereport(WARNING,
+                               (errmsg("abandoned index vacuuming of
table \"%s.%s.%s\" as a fail safe after %d index scans",
+                                               get_database_name(MyDatabaseId),
+                                               vacrel->relname,
+                                               vacrel->relname,
+                                               vacrel->num_index_scans),

The first vacrel->relname should be vacrel->relnamespace.

I think we can use errmsg_plural() for "X index scans" part.

---
+               ereport(elevel,
+                               (errmsg("\"%s\": index scan bypassed:
%u pages from table (%.2f%% of total) have %lld dead item
identifiers",
+                                               vacrel->relname,
vacrel->rel_pages,
+                                               100.0 *
vacrel->lpdead_item_pages / vacrel->rel_pages,
+                                               (long long)
vacrel->lpdead_items)));

We should use vacrel->lpdead_item_pages instead of vacrel->rel_pages

---
+               /* Stop applying cost limits from this point on */
+               VacuumCostActive = false;
+               VacuumCostBalance = 0;
+       }

I agree with the idea of disabling vacuum delay in emergency cases.
But why do we do that only in the case of the table with indexes? I
think this optimization is helpful even in the table with no indexes.
We can check the XID wraparound emergency by calling
vacuum_xid_limit_emergency() at some point to disable vacuum delay?

---
+                                       if (vacrel->do_index_cleanup)
+                                               appendStringInfo(&buf,
_("index scan bypassed:"));
+                                       else
+                                               appendStringInfo(&buf,
_("index scan bypassed due to emergency:")\
);
+                                       msgfmt = _(" %u pages from
table (%.2f%% of total) have %lld dead item identifiers\n");
+                               }

Both vacrel->do_index_vacuuming and vacrel->do_index_cleanup can be
false also when INDEX_CLEANUP is off. So autovacuum could wrongly
report emergency if the table's vacuum_index_vacuum reloption is
false.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#92

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#89)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Mar 29, 2021 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

And now here's v8, which has the following additional cleanup:

I can't effectively review 0001 because it both changes the code for
individual functions significantly and reorders them within the file.
I think it needs to be separated into two patches, one of which makes
the changes and the other of which reorders stuff. I would probably
vote for just dropping the second one, since I'm not sure there's
really enough value there to justify the code churn, but if we're
going to do it, I think it should definitely be done separately.

Here are a few comments on the parts I was able to understand:

* "onerel" is a stupid naming convention that I'd rather not propagate
further. It makes sense in the context of a function whose job it is
to iterate over a list of relations and do something for each one. But
once you're down into the code that only knows about one relation in
the first place, calling that relation "onerel" rather than "rel" or
"vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

* Moving stuff from static variables into LVRelState seems like a
great idea. Renaming it from LVRelStats seems like a good idea, too.

* Setting vacrel->lps = NULL "for now" when we already did palloc0 at
allocation time seems counterproductive.

* The code associated with the comment block that says "Initialize
state for a parallel vacuum" has been moved inside lazy_space_alloc().
That doesn't seem like an especially good choice, because no casual
reader is going to expect a function called lazy_space_alloc() to be
entering parallel mode and so forth as a side effect. Also, the call
to lazy_space_alloc() still has a comment that says "Allocate the
space for dead tuples in case parallel vacuum is not initialized."
even though the ParallelVacuumIsActive() check has been removed and
the function now does a lot more than allocating space.

* lazy_scan_heap() removes the comment which begins "Note that
vacrelstats->dead_tuples could have tuples which became dead after
HOT-pruning but are not marked dead yet." But IIUC that special case
is removed by a later patch, not 0001, in which case it is that patch
that should be touching this comment.

Regarding 0002:

* It took me a while to understand why lazy_scan_new_page() and
lazy_scan_empty_page() are named the way they are. I'm not sure
exactly what would be better, so I am not necessarily saying I think
you have to change anything, but for the record I think this naming
sucks. The reason we have "lazy" in here, AFAIU, is because originally
we only had old-style VACUUM FULL, and that was the good hard-working
VACUUM, and what we now think of as VACUUM was the "lazy" version that
didn't really do the whole job. Then we decided it was the
hard-working version that actually sucked and we always wanted to be
lazy (or else rewrite the table). So now we have all of these
functions named "lazy" which are really just functions to do "vacuum".
But, if we just did s/lazy/vacuum/g we'd be in trouble, because we use
"vacuum" to mean "part of vacuum." That's actually a pretty insane
thing to do, but we like terminological confusion so much that we
decided to use the word vacuum not just to refer to one part of vacuum
but to two different parts of vacuum. During heap vacuuming, which is
the relevant thing here, we call the first part a "scan" and the
second part "vacuum," hence lazy_scan_page() and lazy_vacuum_page().
For indexes, we can decide to vacuum indexes or cleanup indexes,
either of which is part of our overall strategy of trying to do a
VACUUM. We need some words here that are not so overloaded. If, for
example, we could agree that the whole thing is vacuum and the first
time we touch the heap page that's the strawberry phase and then the
second time we touch it that's the rhubarb phase, then we could have
vacuum_strawberry_page(), vacuum_strawberry_new_page(),
vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
assuming that you replaced the words "strawberry" and "rhubarb" with
something actually meaningful. But that seems hard. I thought about
suggesting that the word for strawberry should be "prune", but it does
more than that. I thought about suggesting that either the word for
strawberry or the word for rhubarb should be "cleanup," but that's
another word that is already confusingly overloaded. So I don't know.

* But all that having been said, it's easy to get confused and think
that lazy_scan_new_page() is scanning a new page for lazy vacuum, but
in fact it's the new-page handler for the scan phase of lazy vacuum,
and it doesn't scan anything at all. If there's a way to avoid that
kind of confusion, +1 from me.

* One possibility is that maybe it's not such a great idea to put this
logic in its own function. I'm rather suspicious on principle of
functions that are called with a locked or pinned buffer and release
the lock or pin before returning. It suggests that the abstraction is
not very clean. A related problem is that, post-refactoring, the
parallels between the page-is-new and page-is-empty cases are harder
to spot. Both at least maybe do RecordPageWithFreeSpace(), both do
UnlockReleaseBuffer(), etc. but you have to look at the subroutines to
figure that out after these changes. I understand the value of keeping
the main function shorter, but it doesn't help much if you have to go
jump into all of the subroutines and read them anyway.

* The new comment added which begins "Even if we skipped heap vacuum,
..." is good, but perhaps it could be more optimistic. It seems to me
that it's not just that it *could* be worthwhile because we *could*
have updated freespace, but that those things are in fact probable.

* I'm not really a huge fan of comments that include step numbers,
because they tend to cause future patches to have to change a bunch of
comments every time somebody adds a new step, or, less commonly,
removes an old one. I would suggest revising the comments you've added
that say things like "Step N for block: X" to just "X". I do like the
comment additions, just not the attributing of specific numbers to
specific steps.

* As in 0001, core logical changes are obscured by moving code and
changing it in the same patch. All this logic gets moved into
lazy_scan_prune() and revised at the same time. Using git diff
--color-moved -w sorta works, but even then there are parts of it that
are pretty hard to read, because there's a bunch of other stuff that
gets rejiggered at the same time.

My concentration is flagging a bit so I'm going to stop reviewing here
for now. I'm not deeply opposed to any of what I've seen so far. My
main criticism is that I think more thought should be given to how
things are named and to separating minimal code-movement patches from
other changes.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#93

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#92)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 31, 2021 at 9:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

I can't effectively review 0001 because it both changes the code for
individual functions significantly and reorders them within the file.
I think it needs to be separated into two patches, one of which makes
the changes and the other of which reorders stuff. I would probably
vote for just dropping the second one, since I'm not sure there's
really enough value there to justify the code churn, but if we're
going to do it, I think it should definitely be done separately.

Thanks for the review!

I'll split it up that way. I think that I need to see it both ways
before deciding if I should push back on that. I will admit that I was
a bit zealous in rearranging things because it seems long overdue. But
I might well have gone too far with rearranging code.

* "onerel" is a stupid naming convention that I'd rather not propagate
further. It makes sense in the context of a function whose job it is
to iterate over a list of relations and do something for each one. But
once you're down into the code that only knows about one relation in
the first place, calling that relation "onerel" rather than "rel" or
"vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

I agree, and can change it. Though at the cost of more diff churn.

* Moving stuff from static variables into LVRelState seems like a
great idea. Renaming it from LVRelStats seems like a good idea, too.

The static variables were bad, but nowhere near as bad as the
variables that are local to lazy_scan_heap(). They are currently a
gigantic mess.

Not that LVRelStats was much better. We have the latestRemovedXid
field in LVRelStats, and dead_tuples, but *don't* have a bunch of
things that really are stats (it seems to be quite random). Calling
the struct LVRelStats was always dubious.

* Setting vacrel->lps = NULL "for now" when we already did palloc0 at
allocation time seems counterproductive.

Okay, will fix.

* The code associated with the comment block that says "Initialize
state for a parallel vacuum" has been moved inside lazy_space_alloc().
That doesn't seem like an especially good choice, because no casual
reader is going to expect a function called lazy_space_alloc() to be
entering parallel mode and so forth as a side effect. Also, the call
to lazy_space_alloc() still has a comment that says "Allocate the
space for dead tuples in case parallel vacuum is not initialized."
even though the ParallelVacuumIsActive() check has been removed and
the function now does a lot more than allocating space.

Will fix.

* lazy_scan_heap() removes the comment which begins "Note that
vacrelstats->dead_tuples could have tuples which became dead after
HOT-pruning but are not marked dead yet." But IIUC that special case
is removed by a later patch, not 0001, in which case it is that patch
that should be touching this comment.

Will fix.

Regarding 0002:

* It took me a while to understand why lazy_scan_new_page() and
lazy_scan_empty_page() are named the way they are. I'm not sure
exactly what would be better, so I am not necessarily saying I think
you have to change anything, but for the record I think this naming
sucks.

I agree -- it's dreadful.

The reason we have "lazy" in here, AFAIU, is because originally
we only had old-style VACUUM FULL, and that was the good hard-working
VACUUM, and what we now think of as VACUUM was the "lazy" version that
didn't really do the whole job. Then we decided it was the
hard-working version that actually sucked and we always wanted to be
lazy (or else rewrite the table). So now we have all of these
functions named "lazy" which are really just functions to do "vacuum".

FWIW I always thought that the terminology was lifted from the world
of garbage collection. There is a thing called a lazy sweep algorithm.
Isn't vacuuming very much like sweeping? There are also mark-sweep
garbage collection algorithms that take two passes, one phase
variants, etc.

In general the world of garbage collection has some ideas that might
be worth pilfering for ideas. It's not all that relevant to our world,
and a lot of it is totally irrelevant, but there is enough overlap for
it to interest me. Though GC is such a vast and complicated world that
it's difficult to know where to begin. I own a copy of the book
"Garbage Collection: Algorithms for Automatic Dynamic Memory
Management". Most of it goes over my head, but I have a feeling that
I'll get my $20 worth at some point.

If, for example, we could agree that the whole thing is vacuum and the first
time we touch the heap page that's the strawberry phase and then the
second time we touch it that's the rhubarb phase, then we could have
vacuum_strawberry_page(), vacuum_strawberry_new_page(),
vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
assuming that you replaced the words "strawberry" and "rhubarb" with
something actually meaningful. But that seems hard. I thought about
suggesting that the word for strawberry should be "prune", but it does
more than that. I thought about suggesting that either the word for
strawberry or the word for rhubarb should be "cleanup," but that's
another word that is already confusingly overloaded. So I don't know.

Maybe we should just choose a novel name that isn't exactly
descriptive but is at least distinct and memorable.

I think that the word for strawberry should be "prune". This isn't
100% accurate because it reduces the first phase to pruning. But it is
a terminology that has verisimilitude, which is no small thing. The
fact is that pruning is pretty much the point of the first phase
(freezing is too, but that happens quite specifically is only
considered for non-pruned items, so it doesn't undermine my point
much). If we called the first/strawberry pass over the heap pruning or
"the prune phase" then we'd have something much more practical and
less confusing than any available alternative that I can think of.
Plus it would still be fruit-based.

I think that our real problem is with Rhubarb. I hate the use of the
terminology "heap vacuum" in the context of the second phase/pass.
Whatever terminology we use, we should frame the second phase as being
mostly about recycling LP_DEAD line pointers my turning them into
LP_UNUSED line pointers. We are recycling the space for "cells" that
get logically freed in the first phase (both in indexes, and finally
in the heap).

I like the idea of framing the first phase as being concerned with the
logical database, while the second phase (which includes index
vacuuming and heap vacuuming) is concerned only with physical data
structures (so it's much dumber than the first). That's only ~99% true
today, but the third/"tupgone gone" patch will make it 100% true.

* But all that having been said, it's easy to get confused and think
that lazy_scan_new_page() is scanning a new page for lazy vacuum, but
in fact it's the new-page handler for the scan phase of lazy vacuum,
and it doesn't scan anything at all. If there's a way to avoid that
kind of confusion, +1 from me.

This is another case where I need to see it the other way.

* One possibility is that maybe it's not such a great idea to put this
logic in its own function. I'm rather suspicious on principle of
functions that are called with a locked or pinned buffer and release
the lock or pin before returning. It suggests that the abstraction is
not very clean.

I am sympathetic here. In fact most of those functions were added at
the suggestion of Andres. I think that they're fine, but it's
reasonable to wonder if we're coming out ahead by having all but one
of them (lazy_scan_prune()). The reality is that they share state
fairly promiscuously, so I'm not really hiding complexity. The whole
idea here should be to remove inessential complexity in how we
represent and consume state.

However, it's totally different in the case of the one truly important
function among this group of new lazy_scan_heap() functions,
lazy_scan_prune(). It seems like a *huge* improvement to me. The
obvious advantage of having that function is that calling that
function can be considered a shorthand for "a blkno loop iteration
that actually does real work". Everything else in the calling loop is
essentially either preamble or postscript to lazy_scan_prune(), since
we don't actually need to set VM bits, or to skip heap blocks, or to
save space in the FSM. I think that that's a big difference.

There is a slightly less obvious advantage, too. It's clear that the
function as written actually does do a good job of reducing
state-related complexity, because it effectively returns a
LVPagePruneState (we pass a pointer but nothing gets initialized
before the call to lazy_scan_prune()). So now it's really obvious what
state is managed by pruning/freezing, and it's obvious what follows
from that when we return control to lazy_scan_heap(). This ties
together with my first point about pruning being the truly important
piece of work. That really does hide complexity rather well,
especially compared to the other new functions from the second patch.

* The new comment added which begins "Even if we skipped heap vacuum,
..." is good, but perhaps it could be more optimistic. It seems to me
that it's not just that it *could* be worthwhile because we *could*
have updated freespace, but that those things are in fact probable.

Will fix.

* I'm not really a huge fan of comments that include step numbers,
because they tend to cause future patches to have to change a bunch of
comments every time somebody adds a new step, or, less commonly,
removes an old one. I would suggest revising the comments you've added
that say things like "Step N for block: X" to just "X". I do like the
comment additions, just not the attributing of specific numbers to
specific steps.

I see your point.

I added the numbers because I wanted the reader to notice a parallel
construction among these related high-level comments. I wanted each to
act as a bullet point that frames both code and related interspersed
low-level comments (without the risk of it looking like just another
low-level comment).

The numbers are inessential. Maybe I could do "Next step: " at the
start of each comment instead. Leave it with me.

* As in 0001, core logical changes are obscured by moving code and
changing it in the same patch. All this logic gets moved into
lazy_scan_prune() and revised at the same time. Using git diff
--color-moved -w sorta works, but even then there are parts of it that
are pretty hard to read, because there's a bunch of other stuff that
gets rejiggered at the same time.

Theoretically, nothing really changes until the third patch, except
for the way that we do the INDEX_CLEANUP=off stuff.

What you say about 0001 is understandable and not that surprising. But
FWIW 0002 doesn't move code around in the same way as 0001 -- it is
not nearly as mechanical. Like I said, let me see if the lazy_scan_*
functions from 0002 are adding much (aside from lazy_scan_prune(), of
course). To be honest they were a bit of an afterthought --
lazy_scan_prune() was my focus in 0002.

I can imagine adding a lot more stuff to lazy_scan_prune() in the
future too. For example, maybe we can do more freezing earlier based
on whether or not we can avoid dirtying the page by not freezing. The
structure of lazy_scan_prune() can do stuff like that because it gets
to see the page before and after both pruning and freezing -- and with
the retry stuff in 0003, it can back out of an earlier decision to not
freeze as soon as it realizes the page is getting dirtied either way.
Also, I think we might end up collecting more complicated information
that informs our eventual decision about whether or not indexes need
to be vacuumed.

My concentration is flagging a bit so I'm going to stop reviewing here
for now. I'm not deeply opposed to any of what I've seen so far. My
main criticism is that I think more thought should be given to how
things are named and to separating minimal code-movement patches from
other changes.

That seems totally reasonable.

Thanks again!

--
Peter Geoghegan

#94

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#91)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 31, 2021 at 4:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
merge them? I basically agree with the refactoring made by 0001 patch
but I'm concerned a bit that having such a large refactoring at very
close to feature freeze could be risky. We would need more eyes to
review during stabilization.

I think that Robert makes some related points about how we might cut
scope here. So I'll definitely do some of that, maybe all of it.

I think it's more clear to use this macro. The macro can be like this:

ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

Yes, that might be better. I'll consider it when I get back to the
patch tomorrow.

+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
*/

The second sentence of the removed lines still seems to be useful
information for readers?

I don't think that the stuff about shared memory was useful, really.
If we say something like this then it should be about the LVRelState
pointer, not the struct.

- * We do not process them because it's
a very rare condition,
- * and the next vacuum will process them anyway.

Maybe the above comments should not be removed by 0001 patch.

Right.

Looking at the comments, I thought that this function also frees
palloc'd dead tuple space but it doesn't. It seems to more clear that
doing pfree(vacrel->dead_tuples) here or not creating
lazy_space_free().

I'll need to think about this some more.

---
+       if (shared_istat)
+       {
+               /* Get the space for IndexBulkDeleteResult */
+               bulkdelete_res = &(shared_istat->istat);
+
+               /*
+                * Update the pointer to the corresponding
bulk-deletion result if
+                * someone has already updated it.
+                */
+               if (shared_istat->updated && istat == NULL)
+                       istat = bulkdelete_res;
+       }

(snip)

+       if (shared_istat && !shared_istat->updated && istat != NULL)
+       {
+               memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+               shared_istat->updated = true;
+
+               /*
+                * Now that top-level indstats[idx] points to the DSM
segment, we
+                * don't need the locally allocated results.
+                */
+               pfree(istat);
+               istat = bulkdelete_res;
+       }
+
+       return istat;

I'll try it that way and see how it goes.

+           /* This won't have changed: */
+           Assert(savefreespace && freespace == PageGetHeapFreeSpace(page));
This assertion can be false because freespace can be 0 if the page's
PD_HAS_FREE_LINES hint can wrong. Since lazy_vacuum_heap_page() fixes
it, PageGetHeapFreeSpace(page) in the assertion returns non-zero
value.

Good catch, I'll fix it.

The first vacrel->relname should be vacrel->relnamespace.

Will fix.

I think we can use errmsg_plural() for "X index scans" part.

Yeah, I think that that would be more consistent.

We should use vacrel->lpdead_item_pages instead of vacrel->rel_pages

Will fix. I was mostly focussed on the log_autovacuum version, which
is why it looks nice already.

---
+               /* Stop applying cost limits from this point on */
+               VacuumCostActive = false;
+               VacuumCostBalance = 0;
+       }
I agree with the idea of disabling vacuum delay in emergency cases.
But why do we do that only in the case of the table with indexes? I
think this optimization is helpful even in the table with no indexes.
We can check the XID wraparound emergency by calling
vacuum_xid_limit_emergency() at some point to disable vacuum delay?

Hmm. I see your point, but at the same time I think that the risk is
lower on a table that has no indexes. It may be true that index
vacuuming doesn't necessarily take the majority of all of the work in
lots of cases. But I think that it is true that it does when things
get very bad -- one-pass/no indexes VACUUM does not care about
maintenance_work_mem, etc.

But let me think about it...I suppose we could do it when one-pass
VACUUM considers vacuuming a range of FSM pages every
VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
a way -- it wouldn't be too bad to check for emergencies in the same
way there.

Both vacrel->do_index_vacuuming and vacrel->do_index_cleanup can be
false also when INDEX_CLEANUP is off. So autovacuum could wrongly
report emergency if the table's vacuum_index_vacuum reloption is
false.

Good point. I will need to account for that so that log_autovacuum's
LOG message does the right thing. Perhaps for other reasons, too.

Thanks for the review!
--
Peter Geoghegan

#95

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#94)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 1, 2021 at 9:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 31, 2021 at 4:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
merge them? I basically agree with the refactoring made by 0001 patch
but I'm concerned a bit that having such a large refactoring at very
close to feature freeze could be risky. We would need more eyes to
review during stabilization.

I think that Robert makes some related points about how we might cut
scope here. So I'll definitely do some of that, maybe all of it.

I think it's more clear to use this macro. The macro can be like this:

ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

Yes, that might be better. I'll consider it when I get back to the
patch tomorrow.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
*/
The second sentence of the removed lines still seems to be useful
information for readers?
I don't think that the stuff about shared memory was useful, really.
If we say something like this then it should be about the LVRelState
pointer, not the struct.

Understood.

---
+               /* Stop applying cost limits from this point on */
+               VacuumCostActive = false;
+               VacuumCostBalance = 0;
+       }
I agree with the idea of disabling vacuum delay in emergency cases.
But why do we do that only in the case of the table with indexes? I
think this optimization is helpful even in the table with no indexes.
We can check the XID wraparound emergency by calling
vacuum_xid_limit_emergency() at some point to disable vacuum delay?
Hmm. I see your point, but at the same time I think that the risk is
lower on a table that has no indexes. It may be true that index
vacuuming doesn't necessarily take the majority of all of the work in
lots of cases. But I think that it is true that it does when things
get very bad -- one-pass/no indexes VACUUM does not care about
maintenance_work_mem, etc.

Agreed.

But let me think about it...I suppose we could do it when one-pass
VACUUM considers vacuuming a range of FSM pages every
VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
a way -- it wouldn't be too bad to check for emergencies in the same
way there.

Yeah, I also thought that would be a good place to check for
emergencies. That sounds reasonable.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#96

Robert Haas

robertmhaas@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#95)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Mar 31, 2021 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

But let me think about it...I suppose we could do it when one-pass
VACUUM considers vacuuming a range of FSM pages every
VACUUM_FSM_EVERY_PAGES. That's kind of similar to index vacuuming, in
a way -- it wouldn't be too bad to check for emergencies in the same
way there.

Yeah, I also thought that would be a good place to check for
emergencies. That sounds reasonable.

Without offering an opinion on this particular implementation choice,
+1 for the idea of trying to make the table-with-indexes and the
table-without-indexes cases work in ways that will feel similar to the
user. Tables without indexes are probably rare in practice, but if
some behaviors are implemented for one case and not the other, it will
probably be confusing. One thought here is that it might help to try
to write documentation for whatever behavior you choose. If it's hard
to document without weasel-words, maybe it's not the best approach.

--
Robert Haas
EDB: http://www.enterprisedb.com

#97

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Robert Haas (#96)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 1, 2021 at 6:14 AM Robert Haas <robertmhaas@gmail.com> wrote:

Without offering an opinion on this particular implementation choice,
+1 for the idea of trying to make the table-with-indexes and the
table-without-indexes cases work in ways that will feel similar to the
user. Tables without indexes are probably rare in practice, but if
some behaviors are implemented for one case and not the other, it will
probably be confusing. One thought here is that it might help to try
to write documentation for whatever behavior you choose. If it's hard
to document without weasel-words, maybe it's not the best approach.

I have found a way to do this that isn't too painful, a little like
the VACUUM_FSM_EVERY_PAGES thing.

I've also found a way to further simplify the table-without-indexes
case: make it behave like a regular two-pass/has-indexes VACUUM with
regard to visibility map stuff when the page doesn't need a call to
lazy_vacuum_heap() (because there are no LP_DEAD items to set
LP_UNUSED on the page following pruning). But when it does call
lazy_vacuum_heap(), the call takes care of everything for
lazy_scan_heap(), which just continues to the next page due to
considering prunestate to have been "invalidated" by the call to
lazy_vacuum_heap(). So there is absolutely minimal special case code
for the table-without-indexes case now.

BTW I removed all of the lazy_scan_heap() utility functions from the
second patch in my working copy of the patch series. You were right
about that -- they weren't useful. We should just have the pruning
wrapper function I've called lazy_scan_prune(), not any of the others.
We only need one local variable in the lazy_vacuum_heap() that isn't
either the prune state set/returned by lazy_scan_prune(), or generic
stuff like a Buffer variable.

--
Peter Geoghegan

#98

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#97)

5 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 1, 2021 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I've also found a way to further simplify the table-without-indexes
case: make it behave like a regular two-pass/has-indexes VACUUM with
regard to visibility map stuff when the page doesn't need a call to
lazy_vacuum_heap() (because there are no LP_DEAD items to set
LP_UNUSED on the page following pruning). But when it does call
lazy_vacuum_heap(), the call takes care of everything for
lazy_scan_heap(), which just continues to the next page due to
considering prunestate to have been "invalidated" by the call to
lazy_vacuum_heap(). So there is absolutely minimal special case code
for the table-without-indexes case now.

Attached is v10, which simplifies the one-pass/table-without-indexes
VACUUM as described.

Other changes (some of which are directly related to the
one-pass/table-without-indexes refactoring):

* The second patch no longer breaks up lazy_scan_heap() into multiple
functions -- we only retain the lazy_scan_prune() function, which is
the one that I find very compelling.

This addresses Robert's concern about the functions -- I think that
it's much better this way, now that I see it.

* No more diff churn in the first patch. This was another concern held
by Robert, as well as by Masahiko.

In general both the first and second patches are much easier to follow now.

* The emergency mechanism is now able to kick in when we happen to be
doing a one-pass/table-without-indexes VACUUM -- no special
cases/"weasel words" are needed.

* Renamed "onerel" to "rel" in the first patch, per Robert's suggestion.

* Fixed various specific issues raised by Masahiko's review,
particularly in the first patch and last patch in the series.

Finally, there is a new patch added to the series in v10:

* I now include a modified version of Matthias van de Meent's line
pointer truncation patch [1]/messages/by-id/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com -- Peter Geoghegan.

Matthias' patch seems very much in scope here. The broader patch
series establishes the principle that we can leave LP_DEAD line
pointers in an unreclaimed state indefinitely, without consequence
(beyond the obvious). We had better avoid line pointer bloat that
cannot be reversed when VACUUM does eventually get around to doing a
second pass over the heap. This is another case where it seems prudent
to keep the costs understandable/linear -- page-level line pointer
bloat seems like a cost that increases in a non-linear fashion, which
undermines the whole idea of modelling when it's okay to skip
index/heap vacuuming. (Also, line pointer bloat sucks.)

Line pointer truncation doesn't happen during pruning, as it did in
Matthias' original patch. In this revised version, line pointer
truncation occurs during the second phase of VACUUM. There are several
reasons to prefer this approach. It seems both safer and more useful
that way (compared to the approach of doing line pointer truncation
during pruning). It also makes intuitive sense to do it this way, at
least to me -- the second pass over the heap is supposed to be for
"freeing" LP_DEAD line pointers.

Many workloads rely heavily on opportunistic pruning. With a workload
that benefits a lot from HOT (e.g. pgbench with heap fillfactor
reduced to 90), there are many LP_UNUSED line pointers, even though we
may never have a VACUUM that actually performs a second heap pass
(because LP_DEAD items cannot accumulate in heap pages). Prior to the
HOT commit in 2007, LP_UNUSED line pointers were strictly something
that VACUUM created from dead tuples. It seems to me that we should
only target the latter "category" of LP_UNUSED line pointers when
considering truncating the array -- we ought to leave pruning
(especially opportunistic pruning that takes place outside of VACUUM)
alone.

(That reminds me -- the second patch now makes VACUUM VERBOSE stop
reporting LP_UNUSED items, because it is so utterly
misleading/confusing -- it now reports on LP_DEAD items instead, which
will bring things in line with log_autovacuum output once the last
patch in the series is in. This is arguably an oversight in the HOT
commit made back in 2007 -- that work kind of created a second
distinct category of LP_UNUSED item that really is totally different,
but it didn't account for why that makes stats about LP_UNUSED items
impossible to reason about.)

Doing truncation during VACUUM's second heap pass this way also makes
the line pointer truncation mechanism more effective. The problem with
truncating the LP array during pruning is that we more or less never
prune when the page is 100% (not ~99%) free of non-LP_UNUSED items --
which is actually the most compelling case for line pointer array
truncation! You can see this with a workload that consists of
alternating range deletions and bulk insertions that reuse the same
space -- think of a queue pattern, or TPC-C's new_orders table. Under
this scheme, we catch that extreme (though important) case every time
-- because we consider LP_UNUSED items immediately after they become
LP_UNUSED.

[1]: /messages/by-id/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v10-0002-Refactor-lazy_scan_heap.patchapplication/octet-stream; name=v10-0002-Refactor-lazy_scan_heap.patchDownload

From 60daf3d71ea5d7ed2f970fb5fef5ae826fc9cbe1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v10 2/5] Refactor lazy_scan_heap().

Break lazy_scan_heap() up into several new subsidiary functions.  The
largest and most important new subsidiary function handles heap pruning
and tuple freezing.  This is preparation for an upcoming patch to remove
the "tupgone" special case from vacuumlazy.c.

Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off
from the logic used by single-heap-pass VACUUMs.  The former case is now
structured as the omission of index and heap vacuuming by a two pass
VACUUM.  The latter case goes back to being used only when the table
happens to have no indexes.  This is simpler and more natural -- the
whole point of INDEX_CLEANUP=off is to skip the index and heap vacuuming
that would otherwise take place.  The single-heap-pass case doesn't skip
anything, though -- it just does heap vacuuming in the same single pass
over the heap as pruning (which is only safe with a table that happens
to have no indexes).

Also fix a very old bug in single-pass VACUUM VERBOSE output.  We were
reporting the number of tuples deleted via pruning as a direct
substitute for reporting the number of LP_DEAD items removed in a
function that deals with the second pass over the heap.  But that
doesn't work at all -- they're two different things.

To fix, start tracking the total number of LP_DEAD items encountered
during pruning, and use that in the report instead.  A single pass
VACUUM will always vacuum away whatever LP_DEAD items a heap page has
immediately after it is pruned, so the total number of LP_DEAD items
encountered during pruning equals the total number vacuumed-away.
(They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay
because skipping index vacuuming is now a totally orthogonal concept to
one-pass VACUUM.)

Also stop reporting empty_pages in VACUUM VERBOSE output, and start
reporting pages_removed instead.  This makes the output of VACUUM
VERBOSE more consistent with log_autovacuum's output (which does not
show empty_pages, but does show pages_removed).  The empty_pages item
doesn't seem very useful.

Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE
output, and start reporting the total number of LP_DEAD items
encountered during pruning instead.  Again, this makes the output of
VACUUM VERBOSE more consistent with log_autovacuum's output (which does
not show the count of unused items) -- a later commit will teach
log_autovacuum to display the count of LP_DEAD items in about the same
way.  It was impossible to sensibly interpret the count of LP_UNUSED
items after the introduction of HOT by commit 282d2a03 because pruning
can create new LP_UNUSED items, so the reported figure is an unnatural
mix of LP_UNUSED items left behind by the last VACUUM and LP_UNUSED
items created during the ongoing VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 1093 +++++++++++++++-----------
 1 file changed, 650 insertions(+), 443 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6bd409c095..5dc9ab404b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -296,8 +296,9 @@ typedef struct LVRelState
 	Relation	rel;
 	Relation   *indrels;
 	int			nindexes;
-	/* useindex = true means two-pass strategy; false means one-pass */
-	bool		useindex;
+	/* Do index vacuuming/cleanup? */
+	bool		do_index_vacuuming;
+	bool		do_index_cleanup;
 
 	/* Buffer access strategy and parallel state */
 	BufferAccessStrategy bstrategy;
@@ -335,6 +336,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
 	BlockNumber tupcount_pages; /* pages whose tuples we counted */
 	BlockNumber pages_removed;	/* pages remove by truncation */
+	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	bool		lock_waiter_detected;
 
@@ -347,12 +349,31 @@ typedef struct LVRelState
 	/* Instrumentation counters */
 	int			num_index_scans;
 	int64		tuples_deleted; /* # deleted from table */
+	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
 									 * table */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
 
+/*
+ * State returned by lazy_scan_prune()
+ */
+typedef struct LVPagePruneState
+{
+	bool		hastup;			/* Page is truncatable? */
+	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/*
+	 * State describes the proper VM bit states to set for the page following
+	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * trust all_frozen result unless all_visible is also set to true.
+	 */
+	bool		all_visible;	/* Every item visible to all? */
+	bool		all_frozen;		/* provided all_visible is also true */
+	TransactionId visibility_cutoff_xid;	/* For recovery conflicts */
+} LVPagePruneState;
+
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
 {
@@ -368,6 +389,12 @@ static int	elevel = -1;
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
+static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
+							BlockNumber blkno, Page page,
+							GlobalVisState *vistest,
+							LVPagePruneState *prunestate,
+							VacOptTernaryValue index_cleanup);
+static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
@@ -404,8 +431,6 @@ static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
 static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
 							 BlockNumber relblocks);
 static void lazy_space_free(LVRelState *vacrel);
-static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
-								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
@@ -519,8 +544,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
-	vacrel->useindex = (vacrel->nindexes > 0 &&
-						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->do_index_vacuuming = true;
+	vacrel->do_index_cleanup = true;
+	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+	{
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+	}
 	vacrel->bstrategy = bstrategy;
 	vacrel->old_rel_pages = rel->rd_rel->relpages;
 	vacrel->old_live_tuples = rel->rd_rel->reltuples;
@@ -811,8 +841,8 @@ vacuum_log_cleanup_info(LVRelState *vacrel)
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
  *		all-visible if appropriate.  When done, or when we run low on space
- *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
- *		pointers.
+ *		for dead-tuple TIDs, invoke lazy_vacuum to vacuum indexes and vacuum
+ *		heap relation during its own second pass over the heap.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -835,22 +865,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
-				blkno;
-	HeapTupleData tuple;
-	BlockNumber empty_pages,
-				vacuumed_pages,
+				blkno,
+				next_unskippable_block,
 				next_fsm_block_to_vacuum;
-	double		num_tuples,		/* total number of nonremovable tuples */
-				live_tuples,	/* live tuples (reltuples estimate) */
-				tups_vacuumed,	/* tuples cleaned up by current vacuum */
-				nkeep,			/* dead-but-not-removable tuples */
-				nunused;		/* # existing unused line pointers */
-	int			i;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	BlockNumber next_unskippable_block;
 	bool		skipping_blocks;
-	xl_heap_freeze_tuple *frozen;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -873,23 +893,23 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relnamespace,
 						vacrel->relname)));
 
-	empty_pages = vacuumed_pages = 0;
-	next_fsm_block_to_vacuum = (BlockNumber) 0;
-	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
-
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
+	next_unskippable_block = 0;
+	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
 	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
+	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
 	vacrel->lock_waiter_detected = false;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
+	vacrel->lpdead_items = 0;
 	vacrel->new_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
@@ -906,7 +926,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 */
 	lazy_space_alloc(vacrel, params->nworkers, nblocks);
 	dead_tuples = vacrel->dead_tuples;
-	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -958,7 +977,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * the last page.  This is worth avoiding mainly because such a lock must
 	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	next_unskippable_block = 0;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
 		while (next_unskippable_block < nblocks)
@@ -992,20 +1010,13 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	{
 		Buffer		buf;
 		Page		page;
-		OffsetNumber offnum,
-					maxoff;
-		bool		tupgone,
-					hastup;
-		int			prev_dead_count;
-		int			nfrozen;
-		Size		freespace;
 		bool		all_visible_according_to_vm = false;
-		bool		all_visible;
-		bool		all_frozen = true;	/* provided all_visible is also true */
-		bool		has_dead_items;		/* includes existing LP_DEAD items */
-		TransactionId visibility_cutoff_xid = InvalidTransactionId;
+		LVPagePruneState prunestate;
 
-		/* see note above about forcing scanning of last page */
+		/*
+		 * Consider need to skip blocks.  See note above about forcing
+		 * scanning of last page.
+		 */
 #define FORCE_CHECK_PAGE() \
 		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
@@ -1092,8 +1103,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		vacuum_delay_point();
 
 		/*
-		 * If we are close to overrunning the available space for dead-tuple
-		 * TIDs, pause and do a cycle of vacuuming before we tackle this page.
+		 * Consider if we definitely have enough space to process TIDs on page
+		 * already.  If we are close to overrunning the available space for
+		 * dead-tuple TIDs, pause and do a cycle of vacuuming before we tackle
+		 * this page.
 		 */
 		if ((dead_tuples->max_tuples - dead_tuples->num_tuples) < MaxHeapTuplesPerPage &&
 			dead_tuples->num_tuples > 0)
@@ -1110,18 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				vmbuffer = InvalidBuffer;
 			}
 
-			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(vacrel);
-
-			/* Remove tuples from heap */
-			lazy_vacuum_heap_rel(vacrel);
-
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+			/* Remove the collected garbage tuples from table and indexes */
+			lazy_vacuum(vacrel);
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1137,6 +1140,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
+		 * Set up visibility map page as needed.
+		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.  However, it's
@@ -1149,9 +1154,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
 
-		/* We need buffer cleanup lock so that we can prune HOT chains. */
+		/*
+		 * We need buffer cleanup lock so that we can prune HOT chains and
+		 * defragment the page.
+		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
+			bool		hastup;
+
 			/*
 			 * If we're not performing an aggressive scan to guard against XID
 			 * wraparound, and we don't want to forcibly check the page, then
@@ -1208,6 +1218,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			/* drop through to normal processing */
 		}
 
+		/*
+		 * By here we definitely have enough dead_tuples space for whatever
+		 * LP_DEAD tids are on this page, we have the visibility map page set
+		 * up in case we need to set this page's all_visible/all_frozen bit,
+		 * and we have a super-exclusive lock.  Any tuples on this page are
+		 * now considered "counted".
+		 *
+		 * One last piece of preamble needs to take place before we can prune:
+		 * we need to consider new and empty pages.
+		 */
 		vacrel->scanned_pages++;
 		vacrel->tupcount_pages++;
 
@@ -1236,13 +1256,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 */
 			UnlockReleaseBuffer(buf);
 
-			empty_pages++;
-
 			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
 			{
-				Size		freespace;
+				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
 
-				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
 				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 			}
 			continue;
@@ -1250,8 +1267,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		if (PageIsEmpty(page))
 		{
-			empty_pages++;
-			freespace = PageGetHeapFreeSpace(page);
+			Size		freespace = PageGetHeapFreeSpace(page);
 
 			/*
 			 * Empty pages are always all-visible and all-frozen (note that
@@ -1291,326 +1307,40 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Prune all HOT-update chains in this page.
+		 * Prune and freeze tuples.
 		 *
-		 * We count tuples removed by the pruning step as removed by VACUUM
-		 * (existing LP_DEAD line pointers don't count).
+		 * Accumulates details of remaining LP_DEAD line pointers on page in
+		 * dead tuple list.  This includes LP_DEAD line pointers that we
+		 * pruned ourselves, as well as existing LP_DEAD line pointers that
+		 * were pruned some time earlier.
+		 *
+		 * This also handles tuple freezing, which is closely related to
+		 * pruning.  Considers freezing XIDs in tuple headers from items not
+		 * made LP_DEAD by pruning.
 		 */
-		tups_vacuumed += heap_page_prune(vacrel->rel, buf, vistest,
-										 InvalidTransactionId, 0, false,
-										 &vacrel->latestRemovedXid,
-										 &vacrel->offnum);
+		lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate,
+						params->index_cleanup);
+
+		/* Remember the location of the last page with nonremovable tuples */
+		if (prunestate.hastup)
+			vacrel->nonempty_pages = blkno + 1;
 
 		/*
-		 * Now scan the page to collect vacuumable items and check for tuples
-		 * requiring freezing.
+		 * Consider heap vacuuming for one pass strategy
 		 */
-		all_visible = true;
-		has_dead_items = false;
-		nfrozen = 0;
-		hastup = false;
-		prev_dead_count = dead_tuples->num_tuples;
-		maxoff = PageGetMaxOffsetNumber(page);
-
-		/*
-		 * Note: If you change anything in the loop below, also look at
-		 * heap_page_is_all_visible to see if that needs to be changed.
-		 */
-		for (offnum = FirstOffsetNumber;
-			 offnum <= maxoff;
-			 offnum = OffsetNumberNext(offnum))
+		if (vacrel->nindexes == 0)
 		{
-			ItemId		itemid;
-
-			/*
-			 * Set the offset number so that we can display it along with any
-			 * error that occurred while processing this tuple.
-			 */
-			vacrel->offnum = offnum;
-			itemid = PageGetItemId(page, offnum);
-
-			/* Unused items require no processing, but we count 'em */
-			if (!ItemIdIsUsed(itemid))
+			if (prunestate.has_lpdead_items)
 			{
-				nunused += 1;
-				continue;
-			}
-
-			/* Redirect items mustn't be touched */
-			if (ItemIdIsRedirected(itemid))
-			{
-				hastup = true;	/* this page won't be truncatable */
-				continue;
-			}
-
-			ItemPointerSet(&(tuple.t_self), blkno, offnum);
-
-			/*
-			 * LP_DEAD line pointers are to be vacuumed normally; but we don't
-			 * count them in tups_vacuumed, else we'd be double-counting (at
-			 * least in the common case where heap_page_prune() just freed up
-			 * a non-HOT tuple).  Note also that the final tups_vacuumed value
-			 * might be very low for tables where opportunistic page pruning
-			 * happens to occur very frequently (via heap_page_prune_opt()
-			 * calls that free up non-HOT tuples).
-			 */
-			if (ItemIdIsDead(itemid))
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				all_visible = false;
-				has_dead_items = true;
-				continue;
-			}
-
-			Assert(ItemIdIsNormal(itemid));
-
-			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
-			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(vacrel->rel);
-
-			tupgone = false;
-
-			/*
-			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's acquire_sample_rows() does, otherwise
-			 * VACUUM and ANALYZE may produce wildly different reltuples
-			 * values, e.g. when there are many recently-dead tuples.
-			 *
-			 * The logic here is a bit simpler than acquire_sample_rows(), as
-			 * VACUUM can't run inside a transaction block, which makes some
-			 * cases impossible (e.g. in-progress insert from the same
-			 * transaction).
-			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
-			{
-				case HEAPTUPLE_DEAD:
-
-					/*
-					 * Ordinarily, DEAD tuples would have been removed by
-					 * heap_page_prune(), but it's possible that the tuple
-					 * state changed since heap_page_prune() looked.  In
-					 * particular an INSERT_IN_PROGRESS tuple could have
-					 * changed to DEAD if the inserter aborted.  So this
-					 * cannot be considered an error condition.
-					 *
-					 * If the tuple is HOT-updated then it must only be
-					 * removed by a prune operation; so we keep it just as if
-					 * it were RECENTLY_DEAD.  Also, if it's a heap-only
-					 * tuple, we choose to keep it, because it'll be a lot
-					 * cheaper to get rid of it in the next pruning pass than
-					 * to treat it like an indexed tuple. Finally, if index
-					 * cleanup is disabled, the second heap pass will not
-					 * execute, and the tuple will not get removed, so we must
-					 * treat it like any other dead tuple that we choose to
-					 * keep.
-					 *
-					 * If this were to happen for a tuple that actually needed
-					 * to be deleted, we'd be in trouble, because it'd
-					 * possibly leave a tuple below the relation's xmin
-					 * horizon alive.  heap_prepare_freeze_tuple() is prepared
-					 * to detect that case and abort the transaction,
-					 * preventing corruption.
-					 */
-					if (HeapTupleIsHotUpdated(&tuple) ||
-						HeapTupleIsHeapOnly(&tuple) ||
-						params->index_cleanup == VACOPT_TERNARY_DISABLED)
-						nkeep += 1;
-					else
-						tupgone = true; /* we can delete the tuple */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_LIVE:
-
-					/*
-					 * Count it as live.  Not only is this natural, but it's
-					 * also what acquire_sample_rows() does.
-					 */
-					live_tuples += 1;
-
-					/*
-					 * Is the tuple definitely visible to all transactions?
-					 *
-					 * NB: Like with per-tuple hint bits, we can't set the
-					 * PD_ALL_VISIBLE flag if the inserter committed
-					 * asynchronously. See SetHintBits for more info. Check
-					 * that the tuple is hinted xmin-committed because of
-					 * that.
-					 */
-					if (all_visible)
-					{
-						TransactionId xmin;
-
-						if (!HeapTupleHeaderXminCommitted(tuple.t_data))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/*
-						 * The inserter definitely committed. But is it old
-						 * enough that everyone sees it as committed?
-						 */
-						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
-						{
-							all_visible = false;
-							break;
-						}
-
-						/* Track newest xmin on page. */
-						if (TransactionIdFollows(xmin, visibility_cutoff_xid))
-							visibility_cutoff_xid = xmin;
-					}
-					break;
-				case HEAPTUPLE_RECENTLY_DEAD:
-
-					/*
-					 * If tuple is recently deleted then we must not remove it
-					 * from relation.
-					 */
-					nkeep += 1;
-					all_visible = false;
-					break;
-				case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-					/*
-					 * This is an expected case during concurrent vacuum.
-					 *
-					 * We do not count these rows as live, because we expect
-					 * the inserting transaction to update the counters at
-					 * commit, and we assume that will happen only after we
-					 * report our results.  This assumption is a bit shaky,
-					 * but it is what acquire_sample_rows() does, so be
-					 * consistent.
-					 */
-					all_visible = false;
-					break;
-				case HEAPTUPLE_DELETE_IN_PROGRESS:
-					/* This is an expected case during concurrent vacuum */
-					all_visible = false;
-
-					/*
-					 * Count such rows as live.  As above, we assume the
-					 * deleting transaction will commit and update the
-					 * counters after we report.
-					 */
-					live_tuples += 1;
-					break;
-				default:
-					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-					break;
-			}
-
-			if (tupgone)
-			{
-				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
-				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrel->latestRemovedXid);
-				tups_vacuumed += 1;
-				has_dead_items = true;
-			}
-			else
-			{
-				bool		tuple_totally_frozen;
-
-				num_tuples += 1;
-				hastup = true;
-
 				/*
-				 * Each non-removable tuple must be checked to see if it needs
-				 * freezing.  Note we already have exclusive buffer lock.
+				 * Do heap vacuuming (mark LP_DEAD item pointers LP_UNUSED)
+				 * for page now, since there won't be a second heap pass
 				 */
-				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  vacrel->relfrozenxid,
-											  vacrel->relminmxid,
-											  vacrel->FreezeLimit,
-											  vacrel->MultiXactCutoff,
-											  &frozen[nfrozen],
-											  &tuple_totally_frozen))
-					frozen[nfrozen++].offset = offnum;
-
-				if (!tuple_totally_frozen)
-					all_frozen = false;
-			}
-		}						/* scan along page */
-
-		/*
-		 * Clear the offset information once we have processed all the tuples
-		 * on the page.
-		 */
-		vacrel->offnum = InvalidOffsetNumber;
-
-		/*
-		 * If we froze any tuples, mark the buffer dirty, and write a WAL
-		 * record recording the changes.  We must log the changes to be
-		 * crash-safe against future truncation of CLOG.
-		 */
-		if (nfrozen > 0)
-		{
-			START_CRIT_SECTION();
-
-			MarkBufferDirty(buf);
-
-			/* execute collected freezes */
-			for (i = 0; i < nfrozen; i++)
-			{
-				ItemId		itemid;
-				HeapTupleHeader htup;
-
-				itemid = PageGetItemId(page, frozen[i].offset);
-				htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
-				heap_execute_freeze_tuple(htup, &frozen[i]);
-			}
-
-			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(vacrel->rel))
-			{
-				XLogRecPtr	recptr;
-
-				recptr = log_heap_freeze(vacrel->rel, buf,
-										 vacrel->FreezeLimit, frozen, nfrozen);
-				PageSetLSN(page, recptr);
-			}
-
-			END_CRIT_SECTION();
-		}
-
-		/*
-		 * If there are no indexes we can vacuum the page right now instead of
-		 * doing a second scan. Also we don't do that but forget dead tuples
-		 * when index cleanup is disabled.
-		 */
-		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
-		{
-			if (vacrel->nindexes == 0)
-			{
-				/* Remove tuples from heap if the table has no index */
 				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
-				vacuumed_pages++;
-				has_dead_items = false;
-			}
-			else
-			{
-				/*
-				 * Here, we have indexes but index cleanup is disabled.
-				 * Instead of vacuuming the dead tuples on the heap, we just
-				 * forget them.
-				 *
-				 * Note that vacrelstats->dead_tuples could have tuples which
-				 * became dead after HOT-pruning but are not marked dead yet.
-				 * We do not process them because it's a very rare condition,
-				 * and the next vacuum will process them anyway.
-				 */
-				Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
-			}
 
-			/*
-			 * Forget the now-vacuumed tuples, and press on, but be careful
-			 * not to reset latestRemovedXid since we want that value to be
-			 * valid.
-			 */
-			dead_tuples->num_tuples = 0;
+				/* Forget the now-vacuumed tuples */
+				dead_tuples->num_tuples = 0;
+			}
 
 			/*
 			 * Periodically do incremental FSM vacuuming to make newly-freed
@@ -1624,16 +1354,49 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
+
+			Assert(dead_tuples->num_tuples == 0);
+			if (prunestate.has_lpdead_items)
+			{
+				/*
+				 * Our call to lazy_vacuum_heap_page() will have set LP_DEAD
+				 * items encountered during pruning to LP_UNUSED, and will
+				 * then have considered if it's possible to set all_visible
+				 * and all_frozen independently of lazy_scan_prune().
+				 *
+				 * We don't want to proceed with setting VM bits based on
+				 * information from prunestate -- it's out of date now.  Just
+				 * record free space in the FSM and move on to next page.
+				 */
+				Size		freespace = PageGetHeapFreeSpace(page);
+
+				UnlockReleaseBuffer(buf);
+				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+				continue;
+			}
+			else
+			{
+				/*
+				 * There was no call to lazy_vacuum_heap_page() because
+				 * pruning didn't encounter/create any LP_DEAD items that
+				 * needed to be vacuumed (i.e. needed to be set to LP_UNUSED).
+				 *
+				 * Prune state has not been invalidated.  Proceed with vm bit
+				 * setting using prunestate.  (We'll record free space in the
+				 * FSM last of all, after dropping lock.)
+				 */
+			}
 		}
 
-		freespace = PageGetHeapFreeSpace(page);
-
-		/* mark page all-visible, if appropriate */
-		if (all_visible && !all_visible_according_to_vm)
+		/*
+		 * Handle setting visibility map bit based on what the VM said about
+		 * the page before pruning started, and using prunestate
+		 */
+		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
-			if (all_frozen)
+			if (prunestate.all_frozen)
 				flags |= VISIBILITYMAP_ALL_FROZEN;
 
 			/*
@@ -1652,7 +1415,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, visibility_cutoff_xid, flags);
+							  vmbuffer, prunestate.visibility_cutoff_xid,
+							  flags);
 		}
 
 		/*
@@ -1686,7 +1450,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * There should never be dead tuples on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (PageIsAllVisible(page) && has_dead_items)
+		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1701,7 +1465,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both.
 		 */
-		else if (all_visible_according_to_vm && all_visible && all_frozen &&
+		else if (all_visible_according_to_vm && prunestate.all_visible &&
+				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
@@ -1714,39 +1479,41 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
 
-		UnlockReleaseBuffer(buf);
-
-		/* Remember the location of the last page with nonremovable tuples */
-		if (hastup)
-			vacrel->nonempty_pages = blkno + 1;
-
 		/*
-		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap_rel, which will compute and record
-		 * its post-compaction free space.  If not, then we're done with this
-		 * page, so remember its free space as-is.  (This path will always be
-		 * taken if there are no indexes.)
+		 * Final steps for block: drop super-exclusive lock, record free space
+		 * in the FSM
 		 */
-		if (dead_tuples->num_tuples == prev_dead_count)
+		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		{
+			/*
+			 * Wait until lazy_vacuum_heap_rel() to save free space.
+			 *
+			 * Note that the one-pass (no indexes) case is only supposed to
+			 * make it this far when there were no LP_DEAD items during
+			 * pruning.
+			 */
+			Assert(vacrel->nindexes > 0);
+			UnlockReleaseBuffer(buf);
+		}
+		else
+		{
+			Size		freespace = PageGetHeapFreeSpace(page);
+
+			UnlockReleaseBuffer(buf);
 			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
 	}
 
-	/* report that everything is scanned and vacuumed */
+	/* report that everything is now scanned */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
 
-	pfree(frozen);
-
-	/* save stats for use later */
-	vacrel->tuples_deleted = tups_vacuumed;
-	vacrel->new_dead_tuples = nkeep;
-
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
 													 vacrel->tupcount_pages,
-													 live_tuples);
+													 vacrel->live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
@@ -1767,13 +1534,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	/* If any tuples need to be deleted, perform final vacuum cycle */
 	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-	{
-		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(vacrel);
-
-		/* Remove tuples from heap */
-		lazy_vacuum_heap_rel(vacrel);
-	}
+		lazy_vacuum(vacrel);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
@@ -1786,7 +1547,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
 
 	/*
@@ -1797,22 +1558,30 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrel->useindex)
+	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
-	if (vacuumed_pages)
+	/*
+	 * If table has no indexes and at least one heap pages was vacuumed, make
+	 * log report that lazy_vacuum_heap_rel would've made had there been
+	 * indexes (having indexes implies using the two pass strategy).
+	 */
+	if (vacrel->nindexes == 0 && vacrel->lpdead_item_pages > 0)
 		ereport(elevel,
-				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrel->relname,
-						tups_vacuumed, vacuumed_pages)));
+				(errmsg("\"%s\": removed %lld dead item identifiers in %u pages",
+						vacrel->relname, (long long) vacrel->lpdead_items,
+						vacrel->lpdead_item_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
-					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, vacrel->OldestXmin);
-	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
-					 nunused);
+					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
+					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+	appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
+					 (long long) vacrel->lpdead_item_pages);
+	appendStringInfo(&buf, ngettext("%u page removed.\n",
+									"%u pages removed.\n",
+									vacrel->pages_removed),
+					 vacrel->pages_removed);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
 									vacrel->pinskipped_pages),
@@ -1821,21 +1590,471 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 									"%u frozen pages.\n",
 									vacrel->frozenskipped_pages),
 					 vacrel->frozenskipped_pages);
-	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
-									"%u pages are entirely empty.\n",
-									empty_pages),
-					 empty_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
-			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
+			(errmsg("\"%s\": found %lld removable, %lld nonremovable row versions in %u out of %u pages",
 					vacrel->relname,
-					tups_vacuumed, num_tuples,
-					vacrel->scanned_pages, nblocks),
+					(long long) vacrel->tuples_deleted,
+					(long long) vacrel->num_tuples, vacrel->scanned_pages,
+					nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
+ *
+ * Caller must hold pin and buffer cleanup lock on the buffer.
+ */
+static void
+lazy_scan_prune(LVRelState *vacrel,
+				Buffer buf,
+				BlockNumber blkno,
+				Page page,
+				GlobalVisState *vistest,
+				LVPagePruneState *prunestate,
+				VacOptTernaryValue index_cleanup)
+{
+	Relation	rel = vacrel->rel;
+	OffsetNumber offnum,
+				maxoff;
+	ItemId		itemid;
+	HeapTupleData tuple;
+	int			tuples_deleted,
+				lpdead_items,
+				new_dead_tuples,
+				num_tuples,
+				live_tuples;
+	int			nfrozen;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/* Initialize (or reset) page-level counters */
+	tuples_deleted = 0;
+	lpdead_items = 0;
+	new_dead_tuples = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+
+	/*
+	 * Prune all HOT-update chains in this page.
+	 *
+	 * We count tuples removed by the pruning step as tuples_deleted.  Its
+	 * final value can be thought of as the number of tuples that have been
+	 * deleted from the table.  It should not be confused with lpdead_items;
+	 * lpdead_items's final value can be thought of as the number of tuples
+	 * that were deleted from indexes.
+	 */
+	tuples_deleted = heap_page_prune(rel, buf, vistest,
+									 InvalidTransactionId, 0, false,
+									 &vacrel->latestRemovedXid,
+									 &vacrel->offnum);
+
+	/*
+	 * Now scan the page to collect LP_DEAD items and check for tuples
+	 * requiring freezing among remaining tuples with storage
+	 */
+	prunestate->hastup = false;
+	prunestate->has_lpdead_items = false;
+	prunestate->all_visible = true;
+	prunestate->all_frozen = true;
+	prunestate->visibility_cutoff_xid = InvalidTransactionId;
+	nfrozen = 0;
+
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		bool		tuple_totally_frozen;
+		bool		tupgone = false;
+
+		/*
+		 * Set the offset number so that we can display it along with any
+		 * error that occurred while processing this tuple.
+		 */
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		/* Redirect items mustn't be touched */
+		if (ItemIdIsRedirected(itemid))
+		{
+			prunestate->hastup = true;	/* page won't be truncatable */
+			continue;
+		}
+
+		/*
+		 * LP_DEAD items are processed outside of the loop.
+		 *
+		 * Note that we deliberately don't set hastup=true in the case of an
+		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
+		 * count_nondeletable_pages() do it -- they only consider pages empty
+		 * when they only have LP_UNUSED items, which is important for
+		 * correctness.
+		 *
+		 * Our assumption is that any LP_DEAD items we encounter here will
+		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
+		 * call count_nondeletable_pages().  In any case our opinion of
+		 * whether or not a page 'hastup' (which is how our caller sets its
+		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
+		 * treated as advisory/unreliable, so we might as well be slightly
+		 * optimistic.
+		 */
+		if (ItemIdIsDead(itemid))
+		{
+			deadoffsets[lpdead_items++] = offnum;
+			prunestate->all_visible = false;
+			prunestate->has_lpdead_items = true;
+			continue;
+		}
+
+		Assert(ItemIdIsNormal(itemid));
+
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(rel);
+
+		/*
+		 * The criteria for counting a tuple as live in this block need to
+		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
+		 * and ANALYZE may produce wildly different reltuples values, e.g.
+		 * when there are many recently-dead tuples.
+		 *
+		 * The logic here is a bit simpler than acquire_sample_rows(), as
+		 * VACUUM can't run inside a transaction block, which makes some cases
+		 * impossible (e.g. in-progress insert from the same transaction).
+		 */
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * Ordinarily, DEAD tuples would have been removed by
+				 * heap_page_prune(), but it's possible that the tuple state
+				 * changed since heap_page_prune() looked.  In particular an
+				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
+				 * inserter aborted.  So this cannot be considered an error
+				 * condition.
+				 *
+				 * If the tuple is HOT-updated then it must only be removed by
+				 * a prune operation; so we keep it just as if it were
+				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
+				 * to keep it, because it'll be a lot cheaper to get rid of it
+				 * in the next pruning pass than to treat it like an indexed
+				 * tuple. Finally, if index cleanup is disabled, the second
+				 * heap pass will not execute, and the tuple will not get
+				 * removed, so we must treat it like any other dead tuple that
+				 * we choose to keep.
+				 *
+				 * If this were to happen for a tuple that actually needed to
+				 * be deleted, we'd be in trouble, because it'd possibly leave
+				 * a tuple below the relation's xmin horizon alive.
+				 * heap_prepare_freeze_tuple() is prepared to detect that case
+				 * and abort the transaction, preventing corruption.
+				 */
+				if (HeapTupleIsHotUpdated(&tuple) ||
+					HeapTupleIsHeapOnly(&tuple) ||
+					index_cleanup == VACOPT_TERNARY_DISABLED)
+					new_dead_tuples++;
+				else
+					tupgone = true; /* we can delete the tuple */
+				prunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count it as live.  Not only is this natural, but it's also
+				 * what acquire_sample_rows() does.
+				 */
+				live_tuples++;
+
+				/*
+				 * Is the tuple definitely visible to all transactions?
+				 *
+				 * NB: Like with per-tuple hint bits, we can't set the
+				 * PD_ALL_VISIBLE flag if the inserter committed
+				 * asynchronously. See SetHintBits for more info. Check that
+				 * the tuple is hinted xmin-committed because of that.
+				 */
+				if (prunestate->all_visible)
+				{
+					TransactionId xmin;
+
+					if (!HeapTupleHeaderXminCommitted(tuple.t_data))
+					{
+						prunestate->all_visible = false;
+						break;
+					}
+
+					/*
+					 * The inserter definitely committed. But is it old enough
+					 * that everyone sees it as committed?
+					 */
+					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					{
+						prunestate->all_visible = false;
+						break;
+					}
+
+					/* Track newest xmin on page. */
+					if (TransactionIdFollows(xmin, prunestate->visibility_cutoff_xid))
+						prunestate->visibility_cutoff_xid = xmin;
+				}
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * If tuple is recently deleted then we must not remove it
+				 * from relation.  (We only remove items that are LP_DEAD from
+				 * pruning.)
+				 */
+				new_dead_tuples++;
+				prunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * We do not count these rows as live, because we expect the
+				 * inserting transaction to update the counters at commit, and
+				 * we assume that will happen only after we report our
+				 * results.  This assumption is a bit shaky, but it is what
+				 * acquire_sample_rows() does, so be consistent.
+				 */
+				prunestate->all_visible = false;
+				break;
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+				/* This is an expected case during concurrent vacuum */
+				prunestate->all_visible = false;
+
+				/*
+				 * Count such rows as live.  As above, we assume the deleting
+				 * transaction will commit and update the counters after we
+				 * report.
+				 */
+				live_tuples++;
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (tupgone)
+		{
+			/* Pretend that this is an LP_DEAD item  */
+			deadoffsets[lpdead_items++] = offnum;
+			prunestate->all_visible = false;
+			prunestate->has_lpdead_items = true;
+
+			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
+			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
+												   &vacrel->latestRemovedXid);
+		}
+		else
+		{
+			/*
+			 * Non-removable tuple (i.e. tuple with storage).
+			 *
+			 * Check tuple left behind after pruning to see if needs to be frozen
+			 * now.
+			 */
+			num_tuples++;
+			prunestate->hastup = true;
+			if (heap_prepare_freeze_tuple(tuple.t_data,
+										  vacrel->relfrozenxid,
+										  vacrel->relminmxid,
+										  vacrel->FreezeLimit,
+										  vacrel->MultiXactCutoff,
+										  &frozen[nfrozen],
+										  &tuple_totally_frozen))
+			{
+				/* Will execute freeze below */
+				frozen[nfrozen++].offset = offnum;
+			}
+
+			/*
+			 * If tuple is not frozen (and not about to become frozen) then caller
+			 * had better not go on to set this page's VM bit
+			 */
+			if (!tuple_totally_frozen)
+				prunestate->all_frozen = false;
+		}
+	}
+
+	/*
+	 * We have now divided every item on the page into either an LP_DEAD item
+	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
+	 * that remains and needs to be considered for freezing now (LP_UNUSED and
+	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Add page level counters to caller's counts, and then actually process
+	 * LP_DEAD and LP_NORMAL items.
+	 *
+	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
+	 * to pretend that DEAD items are LP_DEAD items.
+	 */
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Consider the need to freeze any items with tuple storage from the page
+	 * first (arbitrary)
+	 */
+	if (nfrozen > 0)
+	{
+		Assert(prunestate->hastup);
+
+		/*
+		 * At least one tuple with storage needs to be frozen -- execute that
+		 * now.
+		 *
+		 * If we need to freeze any tuples we'll mark the buffer dirty, and
+		 * write a WAL record recording the changes.  We must log the changes
+		 * to be crash-safe against future truncation of CLOG.
+		 */
+		START_CRIT_SECTION();
+
+		MarkBufferDirty(buf);
+
+		/* execute collected freezes */
+		for (int i = 0; i < nfrozen; i++)
+		{
+			HeapTupleHeader htup;
+
+			itemid = PageGetItemId(page, frozen[i].offset);
+			htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+			heap_execute_freeze_tuple(htup, &frozen[i]);
+		}
+
+		/* Now WAL-log freezing if necessary */
+		if (RelationNeedsWAL(vacrel->rel))
+		{
+			XLogRecPtr	recptr;
+
+			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+									 frozen, nfrozen);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+
+	/*
+	 * The second pass over the heap can also set visibility map bits, using
+	 * the same approach.  This is important when the table frequently has a
+	 * few old LP_DEAD items on each page by the time we get to it (typically
+	 * because past opportunistic pruning operations freed some non-HOT
+	 * tuples).
+	 *
+	 * VACUUM will call heap_page_is_all_visible() during the second pass over
+	 * the heap to determine all_visible and all_frozen for the page -- this
+	 * is a specialized version of the logic from this function.  Now that
+	 * we've finished pruning and freezing, make sure that we're in total
+	 * agreement with heap_page_is_all_visible() using an assertion.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	/* Note that all_frozen value does not matter when !all_visible */
+	if (prunestate->all_visible)
+	{
+		TransactionId cutoff;
+		bool		all_frozen;
+
+		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
+			Assert(false);
+
+		Assert(lpdead_items == 0);
+		Assert(prunestate->all_frozen == all_frozen);
+
+		/*
+		 * It's possible that we froze tuples and made the page's XID cutoff
+		 * (for recovery conflict purposes) FrozenTransactionId.  This is okay
+		 * because visibility_cutoff_xid will be logged by our caller in a
+		 * moment.
+		 */
+		Assert(cutoff == FrozenTransactionId ||
+			   cutoff == prunestate->visibility_cutoff_xid);
+	}
+#endif
+
+	/* Add page-local counts to whole-VACUUM counts */
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array.  Also record that page has dead items in per-page prunestate.
+	 */
+	if (lpdead_items > 0)
+	{
+		LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+		ItemPointerData tmp;
+
+		Assert(!prunestate->all_visible);
+		Assert(prunestate->has_lpdead_items);
+
+		vacrel->lpdead_item_pages++;
+
+		/*
+		 * Don't actually save item when it is known for sure that both index
+		 * vacuuming and heap vacuuming cannot go ahead during the ongoing
+		 * VACUUM
+		 */
+		if (!vacrel->do_index_vacuuming && vacrel->nindexes > 0)
+			return;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_tuples->itemptrs[dead_tuples->num_tuples++] = tmp;
+		}
+
+		Assert(dead_tuples->num_tuples <= dead_tuples->max_tuples);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_tuples->num_tuples);
+	}
+}
+
+/*
+ * Remove the collected garbage tuples from the table and its indexes.
+ */
+static void
+lazy_vacuum(LVRelState *vacrel)
+{
+	/* Should not end up here with no indexes */
+	Assert(vacrel->nindexes > 0);
+	Assert(!IsParallelWorker());
+	Assert(vacrel->lpdead_item_pages > 0);
+
+	if (!vacrel->do_index_vacuuming)
+	{
+		Assert(!vacrel->do_index_cleanup);
+		vacrel->dead_tuples->num_tuples = 0;
+		return;
+	}
+
+	/* Okay, we're going to do index vacuuming */
+	lazy_vacuum_all_indexes(vacrel);
+
+	/* Remove tuples from heap */
+	lazy_vacuum_heap_rel(vacrel);
+
+	/*
+	 * Forget the now-vacuumed tuples -- just press on
+	 */
+	vacrel->dead_tuples->num_tuples = 0;
+}
+
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
@@ -1843,6 +2062,8 @@ static void
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	Assert(vacrel->nindexes > 0);
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
@@ -1897,6 +2118,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->do_index_vacuuming);
+	Assert(vacrel->do_index_cleanup);
+	Assert(vacrel->num_index_scans > 0);
+
 	/* Report that we are now vacuuming the heap */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
@@ -1981,6 +2206,8 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
 
+	Assert(vacrel->nindexes == 0 || vacrel->do_index_vacuuming);
+
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Update error traceback information */
@@ -2942,14 +3169,14 @@ count_nondeletable_pages(LVRelState *vacrel)
  * Return the maximum number of dead tuples we can record.
  */
 static long
-compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
+compute_max_dead_tuples(BlockNumber relblocks, bool hasindex)
 {
 	long		maxtuples;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
 	autovacuum_work_mem != -1 ?
 	autovacuum_work_mem : maintenance_work_mem;
 
-	if (useindex)
+	if (hasindex)
 	{
 		maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
 		maxtuples = Min(maxtuples, INT_MAX);
@@ -3034,26 +3261,6 @@ lazy_space_free(LVRelState *vacrel)
 	end_parallel_vacuum(vacrel);
 }
 
-/*
- * lazy_record_dead_tuple - remember one deletable tuple
- */
-static void
-lazy_record_dead_tuple(LVDeadTuples *dead_tuples, ItemPointer itemptr)
-{
-	/*
-	 * The array shouldn't overflow under normal behavior, but perhaps it
-	 * could if we are given a really small maintenance_work_mem. In that
-	 * case, just forget the last few tuples (we'll get 'em next time).
-	 */
-	if (dead_tuples->num_tuples < dead_tuples->max_tuples)
-	{
-		dead_tuples->itemptrs[dead_tuples->num_tuples] = *itemptr;
-		dead_tuples->num_tuples++;
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_tuples->num_tuples);
-	}
-}
-
 /*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
-- 
2.27.0

v10-0005-Bypass-index-vacuuming-in-some-cases.patchapplication/octet-stream; name=v10-0005-Bypass-index-vacuuming-in-some-cases.patchDownload

From 054ab1e3ebf6f46877b394da6e5e86709f964abb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v10 5/5] Bypass index vacuuming in some cases.

Bypass index vacuuming in two cases: The case where there are so few
dead tuples that index vacuuming seems unnecessary, and the case where
the relfrozenxid of the table being vacuumed is dangerously far in the
past.

This commit add new GUC parameters vacuum_skip_index_age and
vacuum_multixact_skip_index_age that specify age at which VACUUM
should skip index cleanup to hurry finishing in order to
advance relfrozenxid/relminmxid.

After each index vacuuming (in non-parallel vacuum case), we check if
the table's relfrozenxid/relminmxid are too old comparing those new
GUC parameters. If so, we skip further index vacuuming within the
vacuum operation.

This behavior is intended to deal with the risk of XID wraparound, the
default values are much higher, 1.8 billion.

Although users can set those parameters, VACUUM will silently
adjust the effective value more than 105% of
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so that
only anti-wraparound autovacuuma and aggressive scan have a change to
skip index vacuuming.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   4 +
 src/backend/access/heap/vacuumlazy.c          | 300 ++++++++++++++++--
 src/backend/commands/vacuum.c                 |  61 ++++
 src/backend/utils/misc/guc.c                  |  25 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 doc/src/sgml/config.sgml                      |  51 +++
 doc/src/sgml/maintenance.sgml                 |  10 +-
 7 files changed, 431 insertions(+), 22 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..f1815e7892 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_skip_index_age;
+extern int	vacuum_multixact_skip_index_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -270,6 +272,8 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  TransactionId *xidFullScanLimit,
 								  MultiXactId *multiXactCutoff,
 								  MultiXactId *mxactFullScanLimit);
+extern bool vacuum_xid_limit_emergency(TransactionId relfrozenxid,
+									   MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c7bb0b1f23..3eeecb954f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -103,6 +103,17 @@
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
 
+/*
+ * Threshold that controls whether we bypass index vacuuming and heap
+ * vacuuming.  When we're under the threshold they're deemed unnecessary.
+ * BYPASS_THRESHOLD_PAGES is applied as a multiplier on the table's rel_pages
+ * for those pages known to contain one or more LP_DEAD items.
+ */
+#define BYPASS_THRESHOLD_PAGES	0.02	/* i.e. 2% of rel_pages */
+
+#define BYPASS_EMERGENCY_MIN_PAGES \
+	((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
+
 /*
  * When a table has no indexes, vacuum the FSM after every 8GB, approximately
  * (it won't be exact because we only vacuum FSM after processing a heap page
@@ -299,6 +310,7 @@ typedef struct LVRelState
 	/* Do index vacuuming/cleanup? */
 	bool		do_index_vacuuming;
 	bool		do_index_cleanup;
+	bool		do_failsafe_speedup;
 
 	/* Buffer access strategy and parallel state */
 	BufferAccessStrategy bstrategy;
@@ -392,13 +404,14 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
-static void lazy_vacuum(LVRelState *vacrel);
-static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void lazy_vacuum(LVRelState *vacrel, bool onecall);
+static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int tupindex, Buffer *vmbuffer);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelState *vacrel);
+static bool should_speedup_failsafe(LVRelState *vacrel);
 static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
 static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
@@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 					 &vacrel->indrels);
 	vacrel->do_index_vacuuming = true;
 	vacrel->do_index_cleanup = true;
+	vacrel->do_failsafe_speedup = false;
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 	{
 		vacrel->do_index_vacuuming = false;
@@ -743,6 +757,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
+			if (vacrel->rel_pages > 0)
+			{
+				msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
+
+				if (vacrel->nindexes == 0 || (vacrel->do_index_vacuuming &&
+											  vacrel->num_index_scans == 0))
+					appendStringInfo(&buf, _("index scan not needed:"));
+				else if (vacrel->do_index_vacuuming && vacrel->num_index_scans > 0)
+					appendStringInfo(&buf, _("index scan needed:"));
+				else
+				{
+					msgfmt = _(" %u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
+
+					if (!vacrel->do_failsafe_speedup)
+						appendStringInfo(&buf, _("index scan bypassed:"));
+					else
+						appendStringInfo(&buf, _("index scan bypassed due to emergency:"));
+				}
+				appendStringInfo(&buf, msgfmt,
+								 vacrel->lpdead_item_pages,
+								 100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+								 (long long) vacrel->lpdead_items);
+			}
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -833,7 +870,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -1087,11 +1125,18 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			}
 
 			/* Remove the collected garbage tuples from table and indexes */
-			lazy_vacuum(vacrel);
+			lazy_vacuum(vacrel, false);
+			have_vacuumed_indexes = true;
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
+			 *
+			 * Note also that it's possible that the call to lazy_vacuum()
+			 * decided to end index vacuuming due to an emergency (though not
+			 * for any other reason).  When that happens we can miss out on
+			 * some of the free space that we originally expected to be able
+			 * to pick up within lazy_vacuum_heap_rel().
 			 */
 			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
 									blkno);
@@ -1306,15 +1351,18 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 			/*
 			 * Periodically do incremental FSM vacuuming to make newly-freed
-			 * space visible on upper FSM pages.  Note: although we've cleaned
-			 * the current block, we haven't yet updated its FSM entry (that
-			 * happens further down), so passing end == blkno is correct.
+			 * space visible on upper FSM pages.  This is also a convenient
+			 * point to check if we should do failsafe speedup to avoid
+			 * wraparound failures.  Note: although we've cleaned the current
+			 * block, we haven't yet updated its FSM entry (that happens
+			 * further down), so passing end == blkno is correct.
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
 				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
+				should_speedup_failsafe(vacrel);
 			}
 
 			Assert(dead_tuples->num_tuples == 0);
@@ -1454,6 +1502,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * make available in cases where it's possible to truncate the
 			 * page's line pointer array.
 			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip
+			 * index vacuuming (and so must skip heap vacuuming).  This is
+			 * deemed okay because it only happens in emergencies, or when
+			 * there is very little free space anyway.  (Besides, we start
+			 * recording free space in FSM once we know that index vacuuming
+			 * was abandoned.)
+			 *
 			 * Note that the one-pass (no indexes) case is only supposed to
 			 * make it this far when there were no LP_DEAD items during
 			 * pruning.
@@ -1498,13 +1554,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-		lazy_vacuum(vacrel);
+		lazy_vacuum(vacrel, !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
@@ -1531,6 +1586,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * If table has no indexes and at least one heap pages was vacuumed, make
 	 * log report that lazy_vacuum_heap_rel would've made had there been
 	 * indexes (having indexes implies using the two pass strategy).
+	 *
+	 * We deliberately don't do this in the case where there are indexes but
+	 * index vacuuming was bypassed.  We make a similar report at the point
+	 * that index vacuuming is bypassed, but that's actually quite different
+	 * in one important sense: it shows information about work we _haven't_
+	 * done.
+	 *
+	 * log_autovacuum output does things differently; it consistently presents
+	 * information about LP_DEAD items for the VACUUM as a whole.  We always
+	 * report on each round of index and heap vacuuming separately, though.
 	 */
 	if (vacrel->nindexes == 0 && vacrel->lpdead_item_pages > 0)
 		ereport(elevel,
@@ -1974,10 +2039,19 @@ retry:
 
 /*
  * Remove the collected garbage tuples from the table and its indexes.
+ *
+ * We may choose to bypass index vacuuming at this point.
+ *
+ * In rare emergencies, the ongoing VACUUM operation can be made to skip both
+ * index vacuuming and index cleanup at the point we're called.  This avoids
+ * having the whole system refuse to allocate further XIDs/MultiXactIds due to
+ * wraparound.
  */
 static void
-lazy_vacuum(LVRelState *vacrel)
+lazy_vacuum(LVRelState *vacrel, bool onecall)
 {
+	bool		do_bypass_optimization;
+
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
 	Assert(!IsParallelWorker());
@@ -1990,11 +2064,100 @@ lazy_vacuum(LVRelState *vacrel)
 		return;
 	}
 
-	/* Okay, we're going to do index vacuuming */
-	lazy_vacuum_all_indexes(vacrel);
+	/*
+	 * Consider bypassing index vacuuming (and heap vacuuming) entirely.
+	 *
+	 * It's far from clear how we might assess the point at which bypassing
+	 * index vacuuming starts to make sense.  But it is at least clear that
+	 * VACUUM should not go ahead with index vacuuming in certain extreme
+	 * (though still fairly common) cases.  These are the cases where we have
+	 * _close to_ zero LP_DEAD items/TIDs to delete from indexes.  It would be
+	 * totally arbitrary to perform a round of full index scans in that case,
+	 * while not also doing the same thing when we happen to have _precisely_
+	 * zero TIDs -- so we do neither.  This avoids sharp discontinuities in
+	 * the duration and overhead of successive VACUUM operations that run
+	 * against the same table with the same workload.
+	 *
+	 * Our approach is to bypass index vacuuming only when there are very few
+	 * heap pages with dead items.  Even then, it must be the first and last
+	 * call here for the VACUUM.  We never apply the optimization when
+	 * multiple index scans will be required -- we cannot accumulate "debt"
+	 * without bound.
+	 *
+	 * This threshold we apply allows us to not give as much weight to items
+	 * that are concentrated in relatively few heap pages.  Concentrated
+	 * build-up of LP_DEAD items tends to occur with workloads that have
+	 * non-HOT updates that affect the same logical rows again and again.  It
+	 * is probably not possible for us to keep the visibility map bits for
+	 * these pages set for a useful amount of time anyway.
+	 *
+	 * We apply one further check: the space currently used to store the TIDs
+	 * (the TIDs that tie back to the index tuples we're thinking about not
+	 * deleting this time around) must not exceed 32MB.  This limits the risk
+	 * that we will bypass index vacuuming again and again until eventually
+	 * there is a VACUUM whose dead_tuples space is not resident in L3 cache.
+	 *
+	 * We can be conservative about avoiding eventually reaching some kind of
+	 * cliff edge while still avoiding almost all truly unnecessary index
+	 * vacuuming.
+	 */
+	do_bypass_optimization = false;
+	if (onecall && vacrel->rel_pages > 0)
+	{
+		BlockNumber threshold;
 
-	/* Remove tuples from heap */
-	lazy_vacuum_heap_rel(vacrel);
+		Assert(vacrel->num_index_scans == 0);
+		Assert(vacrel->lpdead_items == vacrel->dead_tuples->num_tuples);
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
+
+		do_bypass_optimization =
+			(vacrel->lpdead_item_pages < threshold &&
+			 vacrel->lpdead_items < MAXDEADTUPLES(32L * 1024L * 1024L));
+	}
+
+	if (do_bypass_optimization)
+	{
+		/*
+		 * Bypass index vacuuming.
+		 *
+		 * Since VACUUM aims to behave as if there were precisely zero index
+		 * tuples, even when there are actually slightly more than zero, we
+		 * will still do index cleanup.  This is expected to have practically
+		 * no overhead with tables where bypassing index vacuuming helps.
+		 */
+		vacrel->do_index_vacuuming = false;
+		ereport(elevel,
+				(errmsg("\"%s\": index scan bypassed: %u pages from table (%.2f%% of total) have %lld dead item identifiers",
+						vacrel->relname, vacrel->rel_pages,
+						100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+						(long long) vacrel->lpdead_items)));
+	}
+	else if (lazy_vacuum_all_indexes(vacrel))
+	{
+		/*
+		 * We successfully completed a round of index vacuuming.  Do related
+		 * heap vacuuming now.
+		 */
+		lazy_vacuum_heap_rel(vacrel);
+	}
+	else
+	{
+		/*
+		 * Emergency case.
+		 *
+		 * we attempted index vacuuming, but didn't finish a full round/full
+		 * index scan.  This happens when relfrozenxid or relminmxid is too
+		 * far in the past.
+		 *
+		 * From this point on the VACUUM operation will do no further index
+		 * vacuuming or heap vacuuming.  It will do any remaining pruning that
+		 * may be required, plus other heap-related and relation-level
+		 * maintenance tasks.  But that's it.
+		 */
+	}
 
 	/*
 	 * Forget the now-vacuumed tuples -- just press on
@@ -2004,16 +2167,30 @@ lazy_vacuum(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Returns true in the common case when all indexes were successfully
+ * vacuumed.  Returns false in rare cases where we determined that the ongoing
+ * VACUUM operation is at risk of taking too long to finish, leading to
+ * wraparound failure.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
+	bool		allindexes = true;
+
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
+	/* Precheck for XID wraparound emergencies */
+	if (should_speedup_failsafe(vacrel))
+	{
+		/* Wraparound emergency -- don't even start an index scan */
+		return false;
+	}
+
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2028,26 +2205,42 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			vacrel->indstats[idx] =
 				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
 									  vacrel);
+
+			if (should_speedup_failsafe(vacrel))
+			{
+				/* Wraparound emergency -- end current index scan */
+				allindexes = false;
+				break;
+			}
 		}
 	}
 	else
 	{
+		/* Note: parallel VACUUM only gets the precheck */
+		allindexes = true;
+
 		/* Outsource everything to parallel variant */
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
 	/*
 	 * We delete all LP_DEAD items from the first heap pass in all indexes on
-	 * each call here.  This makes the next call to lazy_vacuum_heap_rel()
-	 * safe.
+	 * each call here (except calls where we don't finish all indexes).  This
+	 * makes the next call to lazy_vacuum_heap_rel() safe.
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
 		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
 
-	/* Increase and report the number of index scans */
+	/*
+	 * Increase and report the number of index scans.  Note that we include
+	 * the case where we started a round index scanning that we weren't able
+	 * to finish.
+	 */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrel->num_index_scans);
+
+	return allindexes;
 }
 
 /*
@@ -2340,6 +2533,75 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 	return (offnum <= maxoff);
 }
 
+/*
+ * Determine if there is an unacceptable risk of wraparound failure due to the
+ * fact that the ongoing VACUUM is taking too long -- the table that is being
+ * vacuumed should not have a relfrozenxid or relminmxid that is too far in
+ * the past.
+ *
+ * Note that we deliberately don't vary our behavior based on factors like
+ * whether or not the ongoing VACUUM is aggressive.  If it's not aggressive we
+ * probably won't be able to advance relfrozenxid during this VACUUM.  If we
+ * can't, then an anti-wraparound VACUUM should take place immediately after
+ * we finish up.  We should be able to bypass all index vacuuming for the
+ * later anti-wraparound VACUUM.
+ *
+ * If the user-configurable threshold has been crossed then hurry things up:
+ * Stop applying any VACUUM cost delay going forward, and remember to skip any
+ * further index vacuuming (and heap vacuuming too, in the common case where
+ * table has indexes but not in one-pass VACUUM case).  Return true to inform
+ * caller of the emergency.  Otherwise return false.
+ *
+ * Caller is expected to call here before and after vacuuming each index in
+ * the case of two-pass VACUUM, or every BYPASS_EMERGENCY_MIN_PAGES blocks in
+ * the case of no-indexes/one-pass VACUUM.
+ */
+static bool
+should_speedup_failsafe(LVRelState *vacrel)
+{
+	/* Avoid calling vacuum_xid_limit_emergency() very frequently */
+	if (vacrel->num_index_scans == 0 &&
+		vacrel->rel_pages <= BYPASS_EMERGENCY_MIN_PAGES)
+		return false;
+
+	/* Don't warn more than once per VACUUM */
+	if (vacrel->do_failsafe_speedup)
+		return true;
+
+	if (unlikely(vacuum_xid_limit_emergency(vacrel->relfrozenxid,
+											vacrel->relminmxid)))
+	{
+		/*
+		 * Wraparound emergency -- the table's relfrozenxid or relminmxid is
+		 * too far in the past
+		 */
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+		vacrel->do_failsafe_speedup = true;
+
+		ereport(WARNING,
+				(errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a fail safe after %d index scans",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace,
+						vacrel->relname,
+						vacrel->num_index_scans),
+				 errdetail("table's relfrozenxid or relminmxid is too far in the past"),
+				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
+						 "You might also need to consider other ways for VACUUM to keep up with the allocation of transaction IDs.")));
+
+		/* Stop applying cost limits from this point on */
+		VacuumCostActive = false;
+		VacuumCostBalance = 0;
+
+		return true;
+	}
+
+	return false;
+}
+
 static void
 do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 25465b05dd..51e0f4db4d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_skip_index_age;
+int			vacuum_multixact_skip_index_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
@@ -1134,6 +1136,65 @@ vacuum_set_xid_limits(Relation rel,
 	}
 }
 
+/*
+ * vacuum_xid_limit_emergency() -- Handle wraparound emergencies
+ *
+ * Input parameters are the target relation's relfrozenxid and relminmxid.
+ */
+bool
+vacuum_xid_limit_emergency(TransactionId relfrozenxid, MultiXactId relminmxid)
+{
+	TransactionId xid_skip_limit;
+	MultiXactId	  multi_skip_limit;
+	int			  skip_index_vacuum;
+
+	Assert(TransactionIdIsNormal(relfrozenxid));
+	Assert(MultiXactIdIsValid(relminmxid));
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_skip_index_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdIsNormal(relfrozenxid) &&
+		TransactionIdPrecedes(relfrozenxid, xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_skip_index_age,
+							autovacuum_multixact_freeze_max_age * 1.05);
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdIsValid(relminmxid) &&
+		MultiXactIdPrecedes(relminmxid, multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 60a9c7a2a0..e5c6561c6c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2646,6 +2646,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_skip_index_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should skip index vacuuming."),
+			NULL
+		},
+		&vacuum_multixact_skip_index_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3246,7 +3266,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_skip_index_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39da7cc942..79a6a47219 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -675,6 +675,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_skip_index_age = 1800000000
+#vacuum_multixact_skip_index_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0c9128a55d..8b2fe112e4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8601,6 +8601,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-skip-index-age" xreflabel="vacuum_skip_index_age">
+      <term><varname>vacuum_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only anti-wraparound
+        autovacuums and aggressive scans have a chance to skip index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8647,6 +8672,32 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-vacuum-skip-index-age" xreflabel="vacuum_multixact_skip_index_age">
+      <term><varname>vacuum_multixact_skip_index_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_skip_index_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> skips index cleanup if the table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
+        the age specified by this setting.   A <command>VACUUM</command> with skipping
+        index cleanup hurries finishing <command>VACUUM</command> to advance
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        as quickly as possible.  This is an equivalent behavior to setting
+        <literal>OFF</literal> to <literal>INDEX_CLEANUP</literal> option except that
+        this parameters skips index cleanup even in the middle of vacuum operation.
+        The default is 1.8 billion multixacts. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 3bbae6dd91..a81d9ce839 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -609,8 +609,14 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
 
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
+    system will begin to skip index cleanup to hurry finishing vacuum
+    operation. <xref linkend="guc-vacuum-skip-index-age"/> controls when
+    <command>VACUUM</command> and autovacuum do that.
+   </para>
+
+    <para>
+     The system emits warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
 <programlisting>
 WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-- 
2.27.0

v10-0001-Simplify-state-managed-by-VACUUM.patchapplication/octet-stream; name=v10-0001-Simplify-state-managed-by-VACUUM.patchDownload

From cebcf94cf9380aa22499c0f97fddd02d1e3b2242 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:54 -0700
Subject: [PATCH v10 1/5] Simplify state managed by VACUUM.

Reorganize the state struct used by VACUUM -- group related items
together to make it easier to understand.  Also stop relying on stack
variables inside lazy_scan_heap() -- move those into the state struct
instead.  Doing things this way simplifies large groups of related
functions whose function signatures had a lot of unnecessary redundancy.

Switch over to using int64 for the struct fields used to count things
that are reported to the user via log_autovacuum and VACUUM VERBOSE
output.  We were using double, but that doesn't seem to have any
advantages.  Using int64 makes it possible to add assertions that verify
that the first pass over the heap (pruning) encounters precisely the
same number of LP_DEAD items that get deleted from indexes later on, in
the second pass over the heap.  These assertions will be added in later
commits.

Finally, reorder functions so that functions that contain important and
essential steps for VACUUM appear before less important functions.  Also
try to order related functions based on the order on which they're
called during VACUUM.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
---
 src/include/access/genam.h           |    4 +-
 src/include/access/heapam.h          |    2 +-
 src/include/access/tableam.h         |    2 +-
 src/backend/access/heap/vacuumlazy.c | 1417 ++++++++++++++------------
 src/backend/access/index/indexam.c   |    8 +-
 src/backend/commands/vacuum.c        |   76 +-
 6 files changed, 799 insertions(+), 710 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..480a4762f5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,11 +177,11 @@ extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
-												IndexBulkDeleteResult *stats,
+												IndexBulkDeleteResult *istat,
 												IndexBulkDeleteCallback callback,
 												void *callback_state);
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
-												   IndexBulkDeleteResult *stats);
+												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d803f27787..ceb625e13a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -196,7 +196,7 @@ extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/vacuumlazy.c */
 struct VacuumParams;
-extern void heap_vacuum_rel(Relation onerel,
+extern void heap_vacuum_rel(Relation rel,
 							struct VacuumParams *params, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 414b6b4d57..9f1e4a1ac9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -625,7 +625,7 @@ typedef struct TableAmRoutine
 	 * There probably, in the future, needs to be a separate callback to
 	 * integrate with autovacuum's scheduling.
 	 */
-	void		(*relation_vacuum) (Relation onerel,
+	void		(*relation_vacuum) (Relation rel,
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index efe8761702..6bd409c095 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -146,7 +146,7 @@
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
  * parallel mode and the DSM segment is initialized.
  */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
+#define ParallelVacuumIsActive(vacrel) ((vacrel)->lps != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -264,7 +264,7 @@ typedef struct LVShared
 typedef struct LVSharedIndStats
 {
 	bool		updated;		/* are the stats updated? */
-	IndexBulkDeleteResult stats;
+	IndexBulkDeleteResult istat;
 } LVSharedIndStats;
 
 /* Struct for maintaining a parallel vacuum state. */
@@ -290,41 +290,68 @@ typedef struct LVParallelState
 	int			nindexes_parallel_condcleanup;
 } LVParallelState;
 
-typedef struct LVRelStats
+typedef struct LVRelState
 {
-	char	   *relnamespace;
-	char	   *relname;
+	/* Target heap relation and its indexes */
+	Relation	rel;
+	Relation   *indrels;
+	int			nindexes;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
-	/* Overall statistics about rel */
+
+	/* Buffer access strategy and parallel state */
+	BufferAccessStrategy bstrategy;
+	LVParallelState *lps;
+
+	/* Statistics from pg_class when we start out */
 	BlockNumber old_rel_pages;	/* previous value of pg_class.relpages */
-	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages we skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
 	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-	double		new_rel_tuples; /* new estimated total # of tuples */
-	double		new_live_tuples;	/* new estimated total # of live tuples */
-	double		new_dead_tuples;	/* new estimated total # of dead tuples */
-	BlockNumber pages_removed;
-	double		tuples_deleted;
-	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
-	LVDeadTuples *dead_tuples;
-	int			num_index_scans;
+	/* rel's initial relfrozenxid and relminmxid */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
 	TransactionId latestRemovedXid;
-	bool		lock_waiter_detected;
 
-	/* Statistics about indexes */
-	IndexBulkDeleteResult **indstats;
-	int			nindexes;
+	/* VACUUM operation's cutoff for pruning */
+	TransactionId OldestXmin;
+	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
-	/* Used for error callback */
+	/* Error reporting state */
+	char	   *relnamespace;
+	char	   *relname;
 	char	   *indname;
 	BlockNumber blkno;			/* used only for heap operations */
 	OffsetNumber offnum;		/* used only for heap operations */
 	VacErrPhase phase;
-} LVRelStats;
+
+	/*
+	 * State managed by lazy_scan_heap() follows
+	 */
+	LVDeadTuples *dead_tuples;	/* items to vacuum from indexes */
+	BlockNumber rel_pages;		/* total number of pages */
+	BlockNumber scanned_pages;	/* number of pages we examined */
+	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
+	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
+	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber pages_removed;	/* pages remove by truncation */
+	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
+	bool		lock_waiter_detected;
+
+	/* Statistics output by us, for table */
+	double		new_rel_tuples; /* new estimated total # of tuples */
+	double		new_live_tuples;	/* new estimated total # of live tuples */
+	/* Statistics output by index AMs */
+	IndexBulkDeleteResult **indstats;
+
+	/* Instrumentation counters */
+	int			num_index_scans;
+	int64		tuples_deleted; /* # deleted from table */
+	int64		new_dead_tuples;	/* new estimated total # of dead items in
+									 * table */
+	int64		num_tuples;		/* total number of nonremovable tuples */
+	int64		live_tuples;	/* live tuples (reltuples estimate) */
+} LVRelState;
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -334,77 +361,72 @@ typedef struct LVSavedErrInfo
 	VacErrPhase phase;
 } LVSavedErrInfo;
 
-/* A few variables that don't seem worth passing around as parameters */
+/* elevel controls whole VACUUM's verbosity */
 static int	elevel = -1;
 
-static TransactionId OldestXmin;
-static TransactionId FreezeLimit;
-static MultiXactId MultiXactCutoff;
-
-static BufferAccessStrategy vac_strategy;
-
 
 /* non-export function prototypes */
-static void lazy_scan_heap(Relation onerel, VacuumParams *params,
-						   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 						   bool aggressive);
-static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
+static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void lazy_vacuum_heap_rel(LVRelState *vacrel);
+static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, int tupindex, Buffer *vmbuffer);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelStats *vacrelstats);
-static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-									LVRelStats *vacrelstats, LVParallelState *lps,
-									int nindexes);
-static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-							  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_index(Relation indrel,
-							   IndexBulkDeleteResult **stats,
-							   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-							 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
-static bool should_attempt_truncation(VacuumParams *params,
-									  LVRelStats *vacrelstats);
-static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
-static BlockNumber count_nondeletable_pages(Relation onerel,
-											LVRelStats *vacrelstats);
-static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+									LVRelState *vacrel);
+static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
+static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
+static void do_parallel_processing(LVRelState *vacrel,
+								   LVShared *lvshared);
+static void do_serial_processing_for_unsafe_indexes(LVRelState *vacrel,
+													LVShared *lvshared);
+static IndexBulkDeleteResult *parallel_process_one_index(Relation indrel,
+														 IndexBulkDeleteResult *istat,
+														 LVShared *lvshared,
+														 LVSharedIndStats *shared_indstats,
+														 LVRelState *vacrel);
+static void lazy_cleanup_all_indexes(LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
+													IndexBulkDeleteResult *istat,
+													double reltuples,
+													LVRelState *vacrel);
+static IndexBulkDeleteResult *lazy_cleanup_one_index(Relation indrel,
+													 IndexBulkDeleteResult *istat,
+													 double reltuples,
+													 bool estimated_count,
+													 LVRelState *vacrel);
+static bool should_attempt_truncation(LVRelState *vacrel,
+									  VacuumParams *params);
+static void lazy_truncate_heap(LVRelState *vacrel);
+static BlockNumber count_nondeletable_pages(LVRelState *vacrel);
+static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
+static void lazy_space_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber relblocks);
+static void lazy_space_free(LVRelState *vacrel);
 static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples,
 								   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static int	vac_cmp_itemptr(const void *left, const void *right);
-static bool heap_page_is_all_visible(Relation rel, Buffer buf,
-									 LVRelStats *vacrelstats,
+static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
-static void lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-										 LVParallelState *lps, int nindexes);
-static void parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-								  LVDeadTuples *dead_tuples, int nindexes,
-								  LVRelStats *vacrelstats);
-static void vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-								  LVParallelState *lps, int nindexes);
-static void vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-							 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-							 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats);
-static void lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-									 LVParallelState *lps, int nindexes);
-static long compute_max_dead_tuples(BlockNumber relblocks, bool hasindex);
-static int	compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+static int	compute_parallel_vacuum_workers(LVRelState *vacrel,
+											int nrequested,
 											bool *can_parallel_vacuum);
-static void prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-									 int nindexes);
-static void update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-									int nindexes);
-static LVParallelState *begin_parallel_vacuum(Oid relid, Relation *Irel,
-											  LVRelStats *vacrelstats, BlockNumber nblocks,
-											  int nindexes, int nrequested);
-static void end_parallel_vacuum(IndexBulkDeleteResult **stats,
-								LVParallelState *lps, int nindexes);
-static LVSharedIndStats *get_indstats(LVShared *lvshared, int n);
-static bool skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared);
+static void update_index_statistics(LVRelState *vacrel);
+static LVParallelState *begin_parallel_vacuum(LVRelState *vacrel,
+											  BlockNumber nblocks,
+											  int nrequested);
+static void end_parallel_vacuum(LVRelState *vacrel);
+static LVSharedIndStats *parallel_stats_for_idx(LVShared *lvshared, int getidx);
+static bool parallel_processing_is_safe(Relation indrel, LVShared *lvshared);
 static void vacuum_error_callback(void *arg);
-static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info,
+static void update_vacuum_error_info(LVRelState *vacrel,
+									 LVSavedErrInfo *saved_vacrel,
 									 int phase, BlockNumber blkno,
 									 OffsetNumber offnum);
-static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info);
+static void restore_vacuum_error_info(LVRelState *vacrel,
+									  const LVSavedErrInfo *saved_vacrel);
 
 
 /*
@@ -417,12 +439,10 @@ static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo
  *		and locked the relation.
  */
 void
-heap_vacuum_rel(Relation onerel, VacuumParams *params,
+heap_vacuum_rel(Relation rel, VacuumParams *params,
 				BufferAccessStrategy bstrategy)
 {
-	LVRelStats *vacrelstats;
-	Relation   *Irel;
-	int			nindexes;
+	LVRelState *vacrel;
 	PGRUsage	ru0;
 	TimestampTz starttime = 0;
 	WalUsage	walusage_start = pgWalUsage;
@@ -444,15 +464,14 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
+	TransactionId OldestXmin;
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
 	Assert(params->truncate != VACOPT_TERNARY_DEFAULT);
 
-	/* not every AM requires these to be valid, but heap does */
-	Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid));
-	Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid));
-
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
@@ -471,11 +490,9 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 		elevel = DEBUG2;
 
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
-								  RelationGetRelid(onerel));
+								  RelationGetRelid(rel));
 
-	vac_strategy = bstrategy;
-
-	vacuum_set_xid_limits(onerel,
+	vacuum_set_xid_limits(rel,
 						  params->freeze_min_age,
 						  params->freeze_table_age,
 						  params->multixact_freeze_min_age,
@@ -489,42 +506,46 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * table's minimum MultiXactId is older than or equal to the requested
 	 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
 	 */
-	aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
+	aggressive = TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
 											   xidFullScanLimit);
-	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
+	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
-	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
+	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
-	vacrelstats->relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats->relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats->indname = NULL;
-	vacrelstats->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
-	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
-	vacrelstats->old_live_tuples = onerel->rd_rel->reltuples;
-	vacrelstats->num_index_scans = 0;
-	vacrelstats->pages_removed = 0;
-	vacrelstats->lock_waiter_detected = false;
+	/* Set up high level stuff about rel */
+	vacrel->rel = rel;
+	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
+					 &vacrel->indrels);
+	vacrel->useindex = (vacrel->nindexes > 0 &&
+						params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	vacrel->bstrategy = bstrategy;
+	vacrel->old_rel_pages = rel->rd_rel->relpages;
+	vacrel->old_live_tuples = rel->rd_rel->reltuples;
+	vacrel->relfrozenxid = rel->rd_rel->relfrozenxid;
+	vacrel->relminmxid = rel->rd_rel->relminmxid;
+	vacrel->latestRemovedXid = InvalidTransactionId;
 
-	/* Open all indexes of the relation */
-	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
-	vacrelstats->useindex = (nindexes > 0 &&
-							 params->index_cleanup == VACOPT_TERNARY_ENABLED);
+	/* Set cutoffs for entire VACUUM */
+	vacrel->OldestXmin = OldestXmin;
+	vacrel->FreezeLimit = FreezeLimit;
+	vacrel->MultiXactCutoff = MultiXactCutoff;
 
-	vacrelstats->indstats = (IndexBulkDeleteResult **)
-		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
-	vacrelstats->nindexes = nindexes;
+	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
+	vacrel->relname = pstrdup(RelationGetRelationName(rel));
+	vacrel->indname = NULL;
+	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 
 	/* Save index names iff autovacuum logging requires it */
-	if (IsAutoVacuumWorkerProcess() &&
-		params->log_min_duration >= 0 &&
-		vacrelstats->nindexes > 0)
+	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0 &&
+		vacrel->nindexes > 0)
 	{
-		indnames = palloc(sizeof(char *) * vacrelstats->nindexes);
-		for (int i = 0; i < vacrelstats->nindexes; i++)
-			indnames[i] = pstrdup(RelationGetRelationName(Irel[i]));
+		indnames = palloc(sizeof(char *) * vacrel->nindexes);
+		for (int i = 0; i < vacrel->nindexes; i++)
+			indnames[i] =
+				pstrdup(RelationGetRelationName(vacrel->indrels[i]));
 	}
 
 	/*
@@ -539,15 +560,15 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * information is restored at the end of those phases.
 	 */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = vacrelstats;
+	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(vacrel, params, aggressive);
 
 	/* Done with indexes */
-	vac_close_indexes(nindexes, Irel, NoLock);
+	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
 	 * Compute whether we actually scanned the all unfrozen pages. If we did,
@@ -556,8 +577,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * NB: We need to check this before truncating the relation, because that
 	 * will change ->rel_pages.
 	 */
-	if ((vacrelstats->scanned_pages + vacrelstats->frozenskipped_pages)
-		< vacrelstats->rel_pages)
+	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
+		< vacrel->rel_pages)
 	{
 		Assert(!aggressive);
 		scanned_all_unfrozen = false;
@@ -568,17 +589,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	/*
 	 * Optionally truncate the relation.
 	 */
-	if (should_attempt_truncation(params, vacrelstats))
+	if (should_attempt_truncation(vacrel, params))
 	{
 		/*
 		 * Update error traceback information.  This is the last phase during
 		 * which we add context information to errors, so we don't need to
 		 * revert to the previous phase.
 		 */
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
-								 vacrelstats->nonempty_pages,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
+								 vacrel->nonempty_pages,
 								 InvalidOffsetNumber);
-		lazy_truncate_heap(onerel, vacrelstats);
+		lazy_truncate_heap(vacrel);
 	}
 
 	/* Pop the error context stack */
@@ -602,30 +623,30 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
 	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrelstats->rel_pages;
-	new_live_tuples = vacrelstats->new_live_tuples;
+	new_rel_pages = vacrel->rel_pages;
+	new_live_tuples = vacrel->new_live_tuples;
 
-	visibilitymap_count(onerel, &new_rel_allvisible, NULL);
+	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
 	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
 	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
 
-	vac_update_relstats(onerel,
+	vac_update_relstats(rel,
 						new_rel_pages,
 						new_live_tuples,
 						new_rel_allvisible,
-						nindexes > 0,
+						vacrel->nindexes > 0,
 						new_frozen_xid,
 						new_min_multi,
 						false);
 
 	/* report results to the stats collector, too */
-	pgstat_report_vacuum(RelationGetRelid(onerel),
-						 onerel->rd_rel->relisshared,
+	pgstat_report_vacuum(RelationGetRelid(rel),
+						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrelstats->new_dead_tuples);
+						 vacrel->new_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -676,39 +697,39 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			}
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
-							 vacrelstats->relnamespace,
-							 vacrelstats->relname,
-							 vacrelstats->num_index_scans);
+							 vacrel->relnamespace,
+							 vacrel->relname,
+							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
-							 vacrelstats->pages_removed,
-							 vacrelstats->rel_pages,
-							 vacrelstats->pinskipped_pages,
-							 vacrelstats->frozenskipped_pages);
+							 vacrel->pages_removed,
+							 vacrel->rel_pages,
+							 vacrel->pinskipped_pages,
+							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %.0f removed, %.0f remain, %.0f are dead but not yet removable, oldest xmin: %u\n"),
-							 vacrelstats->tuples_deleted,
-							 vacrelstats->new_rel_tuples,
-							 vacrelstats->new_dead_tuples,
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 (long long) vacrel->tuples_deleted,
+							 (long long) vacrel->new_rel_tuples,
+							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
 			appendStringInfo(&buf,
 							 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
-			for (int i = 0; i < vacrelstats->nindexes; i++)
+			for (int i = 0; i < vacrel->nindexes; i++)
 			{
-				IndexBulkDeleteResult *stats = vacrelstats->indstats[i];
+				IndexBulkDeleteResult *istat = vacrel->indstats[i];
 
-				if (!stats)
+				if (!istat)
 					continue;
 
 				appendStringInfo(&buf,
 								 _("index \"%s\": pages: %u in total, %u newly deleted, %u currently deleted, %u reusable\n"),
 								 indnames[i],
-								 stats->num_pages,
-								 stats->pages_newly_deleted,
-								 stats->pages_deleted,
-								 stats->pages_free);
+								 istat->num_pages,
+								 istat->pages_newly_deleted,
+								 istat->pages_deleted,
+								 istat->pages_free);
 			}
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 							 read_rate, write_rate);
@@ -737,10 +758,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	}
 
 	/* Cleanup index statistics and index names */
-	for (int i = 0; i < vacrelstats->nindexes; i++)
+	for (int i = 0; i < vacrel->nindexes; i++)
 	{
-		if (vacrelstats->indstats[i])
-			pfree(vacrelstats->indstats[i]);
+		if (vacrel->indstats[i])
+			pfree(vacrel->indstats[i]);
 
 		if (indnames && indnames[i])
 			pfree(indnames[i]);
@@ -764,20 +785,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
  * which would be after the rows have become inaccessible.
  */
 static void
-vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
+vacuum_log_cleanup_info(LVRelState *vacrel)
 {
 	/*
 	 * Skip this for relations for which no WAL is to be written, or if we're
 	 * not trying to support archive recovery.
 	 */
-	if (!RelationNeedsWAL(rel) || !XLogIsNeeded())
+	if (!RelationNeedsWAL(vacrel->rel) || !XLogIsNeeded())
 		return;
 
 	/*
 	 * No need to write the record at all unless it contains a valid value
 	 */
-	if (TransactionIdIsValid(vacrelstats->latestRemovedXid))
-		(void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid);
+	if (TransactionIdIsValid(vacrel->latestRemovedXid))
+		(void) log_heap_cleanup_info(vacrel->rel->rd_node,
+									 vacrel->latestRemovedXid);
 }
 
 /*
@@ -788,9 +810,9 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		page, and set commit status bits (see heap_page_prune).  It also builds
  *		lists of dead tuples and pages with free space, calculates statistics
  *		on the number of live tuples in the heap, and marks pages as
- *		all-visible if appropriate.  When done, or when we run low on space for
- *		dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap
- *		to reclaim dead line pointers.
+ *		all-visible if appropriate.  When done, or when we run low on space
+ *		for dead-tuple TIDs, invoke vacuuming of indexes and reclaim dead line
+ *		pointers.
  *
  *		If the table has at least two indexes, we execute both index vacuum
  *		and index cleanup with parallel workers unless parallel vacuum is
@@ -809,16 +831,12 @@ vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats)
  *		reference them have been killed.
  */
 static void
-lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 {
-	LVParallelState *lps = NULL;
 	LVDeadTuples *dead_tuples;
 	BlockNumber nblocks,
 				blkno;
 	HeapTupleData tuple;
-	TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
-	TransactionId relminmxid = onerel->rd_rel->relminmxid;
 	BlockNumber empty_pages,
 				vacuumed_pages,
 				next_fsm_block_to_vacuum;
@@ -847,63 +865,47 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 	else
 		ereport(elevel,
 				(errmsg("vacuuming \"%s.%s\"",
-						vacrelstats->relnamespace,
-						vacrelstats->relname)));
+						vacrel->relnamespace,
+						vacrel->relname)));
 
 	empty_pages = vacuumed_pages = 0;
 	next_fsm_block_to_vacuum = (BlockNumber) 0;
 	num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
 
-	nblocks = RelationGetNumberOfBlocks(onerel);
-	vacrelstats->rel_pages = nblocks;
-	vacrelstats->scanned_pages = 0;
-	vacrelstats->tupcount_pages = 0;
-	vacrelstats->nonempty_pages = 0;
-	vacrelstats->latestRemovedXid = InvalidTransactionId;
+	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
+	vacrel->rel_pages = nblocks;
+	vacrel->scanned_pages = 0;
+	vacrel->pinskipped_pages = 0;
+	vacrel->frozenskipped_pages = 0;
+	vacrel->tupcount_pages = 0;
+	vacrel->pages_removed = 0;
+	vacrel->nonempty_pages = 0;
+	vacrel->lock_waiter_detected = false;
 
-	vistest = GlobalVisTestFor(onerel);
+	/* Initialize instrumentation counters */
+	vacrel->num_index_scans = 0;
+	vacrel->tuples_deleted = 0;
+	vacrel->new_dead_tuples = 0;
+	vacrel->num_tuples = 0;
+	vacrel->live_tuples = 0;
+
+	vistest = GlobalVisTestFor(vacrel->rel);
+
+	vacrel->indstats = (IndexBulkDeleteResult **)
+		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
 	/*
-	 * Initialize state for a parallel vacuum.  As of now, only one worker can
-	 * be used for an index, so we invoke parallelism only if there are at
-	 * least two indexes on a table.
+	 * Allocate the space for dead tuples.  Note that this handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used
+	 * for dead_tuples.
 	 */
-	if (params->nworkers >= 0 && vacrelstats->useindex && nindexes > 1)
-	{
-		/*
-		 * Since parallel workers cannot access data in temporary tables, we
-		 * can't perform parallel vacuum on them.
-		 */
-		if (RelationUsesLocalBuffers(onerel))
-		{
-			/*
-			 * Give warning only if the user explicitly tries to perform a
-			 * parallel vacuum on the temporary table.
-			 */
-			if (params->nworkers > 0)
-				ereport(WARNING,
-						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
-								vacrelstats->relname)));
-		}
-		else
-			lps = begin_parallel_vacuum(RelationGetRelid(onerel), Irel,
-										vacrelstats, nblocks, nindexes,
-										params->nworkers);
-	}
-
-	/*
-	 * Allocate the space for dead tuples in case parallel vacuum is not
-	 * initialized.
-	 */
-	if (!ParallelVacuumIsActive(lps))
-		lazy_space_alloc(vacrelstats, nblocks);
-
-	dead_tuples = vacrelstats->dead_tuples;
+	lazy_space_alloc(vacrel, params->nworkers, nblocks);
+	dead_tuples = vacrel->dead_tuples;
 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
@@ -963,7 +965,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		{
 			uint8		vmstatus;
 
-			vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,
+			vmstatus = visibilitymap_get_status(vacrel->rel,
+												next_unskippable_block,
 												&vmbuffer);
 			if (aggressive)
 			{
@@ -1004,11 +1007,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* see note above about forcing scanning of last page */
 #define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(params, vacrelstats))
+		(blkno == nblocks - 1 && should_attempt_truncation(vacrel, params))
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-		update_vacuum_error_info(vacrelstats, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
 		if (blkno == next_unskippable_block)
@@ -1021,7 +1024,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				{
 					uint8		vmskipflags;
 
-					vmskipflags = visibilitymap_get_status(onerel,
+					vmskipflags = visibilitymap_get_status(vacrel->rel,
 														   next_unskippable_block,
 														   &vmbuffer);
 					if (aggressive)
@@ -1053,7 +1056,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno,
+											 &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
@@ -1077,8 +1081,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * know whether it was all-frozen, so we have to recheck; but
 				 * in this case an approximate answer is OK.
 				 */
-				if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
-					vacrelstats->frozenskipped_pages++;
+				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno,
+												&vmbuffer))
+					vacrel->frozenskipped_pages++;
 				continue;
 			}
 			all_visible_according_to_vm = true;
@@ -1106,10 +1111,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Work on all the indexes, then the heap */
-			lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+			lazy_vacuum_all_indexes(vacrel);
 
 			/* Remove tuples from heap */
-			lazy_vacuum_heap(onerel, vacrelstats);
+			lazy_vacuum_heap_rel(vacrel);
 
 			/*
 			 * Forget the now-vacuumed tuples, and press on, but be careful
@@ -1122,7 +1127,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+									blkno);
 			next_fsm_block_to_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
@@ -1137,12 +1143,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * possible that (a) next_unskippable_block is covered by a different
 		 * VM page than the current block or (b) we released our pin and did a
 		 * cycle of index vacuuming.
-		 *
 		 */
-		visibilitymap_pin(onerel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, vacrel->bstrategy);
 
 		/* We need buffer cleanup lock so that we can prune HOT chains. */
 		if (!ConditionalLockBufferForCleanup(buf))
@@ -1156,7 +1161,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			if (!aggressive && !FORCE_CHECK_PAGE())
 			{
 				ReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				continue;
 			}
 
@@ -1177,13 +1182,13 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * to use lazy_check_needs_freeze() for both situations, though.
 			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrelstats))
+			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
 			{
 				UnlockReleaseBuffer(buf);
-				vacrelstats->scanned_pages++;
-				vacrelstats->pinskipped_pages++;
+				vacrel->scanned_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			if (!aggressive)
@@ -1193,9 +1198,9 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * to claiming that the page contains no freezable tuples.
 				 */
 				UnlockReleaseBuffer(buf);
-				vacrelstats->pinskipped_pages++;
+				vacrel->pinskipped_pages++;
 				if (hastup)
-					vacrelstats->nonempty_pages = blkno + 1;
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -1203,8 +1208,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			/* drop through to normal processing */
 		}
 
-		vacrelstats->scanned_pages++;
-		vacrelstats->tupcount_pages++;
+		vacrel->scanned_pages++;
+		vacrel->tupcount_pages++;
 
 		page = BufferGetPage(buf);
 
@@ -1233,12 +1238,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			empty_pages++;
 
-			if (GetRecordedFreeSpace(onerel, blkno) == 0)
+			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
 			{
 				Size		freespace;
 
 				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
-				RecordPageWithFreeSpace(onerel, blkno, freespace);
+				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 			}
 			continue;
 		}
@@ -1269,19 +1274,19 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * page has been previously WAL-logged, and if not, do that
 				 * now.
 				 */
-				if (RelationNeedsWAL(onerel) &&
+				if (RelationNeedsWAL(vacrel->rel) &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
 					log_newpage_buffer(buf, true);
 
 				PageSetAllVisible(page);
-				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 								  vmbuffer, InvalidTransactionId,
 								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
 				END_CRIT_SECTION();
 			}
 
 			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 			continue;
 		}
 
@@ -1291,10 +1296,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * We count tuples removed by the pruning step as removed by VACUUM
 		 * (existing LP_DEAD line pointers don't count).
 		 */
-		tups_vacuumed += heap_page_prune(onerel, buf, vistest,
+		tups_vacuumed += heap_page_prune(vacrel->rel, buf, vistest,
 										 InvalidTransactionId, 0, false,
-										 &vacrelstats->latestRemovedXid,
-										 &vacrelstats->offnum);
+										 &vacrel->latestRemovedXid,
+										 &vacrel->offnum);
 
 		/*
 		 * Now scan the page to collect vacuumable items and check for tuples
@@ -1321,7 +1326,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * Set the offset number so that we can display it along with any
 			 * error that occurred while processing this tuple.
 			 */
-			vacrelstats->offnum = offnum;
+			vacrel->offnum = offnum;
 			itemid = PageGetItemId(page, offnum);
 
 			/* Unused items require no processing, but we count 'em */
@@ -1361,7 +1366,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 			tuple.t_len = ItemIdGetLength(itemid);
-			tuple.t_tableOid = RelationGetRelid(onerel);
+			tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
 			tupgone = false;
 
@@ -1376,7 +1381,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 * cases impossible (e.g. in-progress insert from the same
 			 * transaction).
 			 */
-			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 			{
 				case HEAPTUPLE_DEAD:
 
@@ -1446,7 +1451,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 						 * enough that everyone sees it as committed?
 						 */
 						xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-						if (!TransactionIdPrecedes(xmin, OldestXmin))
+						if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 						{
 							all_visible = false;
 							break;
@@ -1500,7 +1505,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			{
 				lazy_record_dead_tuple(dead_tuples, &(tuple.t_self));
 				HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-													   &vacrelstats->latestRemovedXid);
+													   &vacrel->latestRemovedXid);
 				tups_vacuumed += 1;
 				has_dead_items = true;
 			}
@@ -1516,8 +1521,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
 				if (heap_prepare_freeze_tuple(tuple.t_data,
-											  relfrozenxid, relminmxid,
-											  FreezeLimit, MultiXactCutoff,
+											  vacrel->relfrozenxid,
+											  vacrel->relminmxid,
+											  vacrel->FreezeLimit,
+											  vacrel->MultiXactCutoff,
 											  &frozen[nfrozen],
 											  &tuple_totally_frozen))
 					frozen[nfrozen++].offset = offnum;
@@ -1531,7 +1538,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * Clear the offset information once we have processed all the tuples
 		 * on the page.
 		 */
-		vacrelstats->offnum = InvalidOffsetNumber;
+		vacrel->offnum = InvalidOffsetNumber;
 
 		/*
 		 * If we froze any tuples, mark the buffer dirty, and write a WAL
@@ -1557,12 +1564,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			}
 
 			/* Now WAL-log freezing if necessary */
-			if (RelationNeedsWAL(onerel))
+			if (RelationNeedsWAL(vacrel->rel))
 			{
 				XLogRecPtr	recptr;
 
-				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 frozen, nfrozen);
+				recptr = log_heap_freeze(vacrel->rel, buf,
+										 vacrel->FreezeLimit, frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
 
@@ -1574,12 +1581,12 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * doing a second scan. Also we don't do that but forget dead tuples
 		 * when index cleanup is disabled.
 		 */
-		if (!vacrelstats->useindex && dead_tuples->num_tuples > 0)
+		if (!vacrel->useindex && dead_tuples->num_tuples > 0)
 		{
-			if (nindexes == 0)
+			if (vacrel->nindexes == 0)
 			{
 				/* Remove tuples from heap if the table has no index */
-				lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
 				vacuumed_pages++;
 				has_dead_items = false;
 			}
@@ -1613,7 +1620,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 			{
-				FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+				FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
 										blkno);
 				next_fsm_block_to_vacuum = blkno;
 			}
@@ -1644,7 +1651,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 			 */
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, visibility_cutoff_xid, flags);
 		}
 
@@ -1656,11 +1663,11 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * that something bad has happened.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer))
+				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+				 vacrel->relname, blkno);
+			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1682,10 +1689,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		else if (PageIsAllVisible(page) && has_dead_items)
 		{
 			elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
-				 vacrelstats->relname, blkno);
+				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(onerel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1695,14 +1702,14 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 		 * all_visible is true, so we must check both.
 		 */
 		else if (all_visible_according_to_vm && all_visible && all_frozen &&
-				 !VM_ALL_FROZEN(onerel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
 			 * because setting the all-frozen bit doesn't cause recovery
 			 * conflicts.
 			 */
-			visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1711,43 +1718,42 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (hastup)
-			vacrelstats->nonempty_pages = blkno + 1;
+			vacrel->nonempty_pages = blkno + 1;
 
 		/*
 		 * If we remembered any tuples for deletion, then the page will be
-		 * visited again by lazy_vacuum_heap, which will compute and record
+		 * visited again by lazy_vacuum_heap_rel, which will compute and record
 		 * its post-compaction free space.  If not, then we're done with this
 		 * page, so remember its free space as-is.  (This path will always be
 		 * taken if there are no indexes.)
 		 */
 		if (dead_tuples->num_tuples == prev_dead_count)
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 	}
 
 	/* report that everything is scanned and vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
+	vacrel->blkno = InvalidBlockNumber;
 
 	pfree(frozen);
 
 	/* save stats for use later */
-	vacrelstats->tuples_deleted = tups_vacuumed;
-	vacrelstats->new_dead_tuples = nkeep;
+	vacrel->tuples_deleted = tups_vacuumed;
+	vacrel->new_dead_tuples = nkeep;
 
 	/* now we can compute the new value for pg_class.reltuples */
-	vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel,
-														  nblocks,
-														  vacrelstats->tupcount_pages,
-														  live_tuples);
+	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
+													 vacrel->tupcount_pages,
+													 live_tuples);
 
 	/*
 	 * Also compute the total number of surviving heap entries.  In the
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
-	vacrelstats->new_rel_tuples =
-		Max(vacrelstats->new_live_tuples, 0) + vacrelstats->new_dead_tuples;
+	vacrel->new_rel_tuples =
+		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1763,10 +1769,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	if (dead_tuples->num_tuples > 0)
 	{
 		/* Work on all the indexes, and then the heap */
-		lazy_vacuum_all_indexes(onerel, Irel, vacrelstats, lps, nindexes);
+		lazy_vacuum_all_indexes(vacrel);
 
 		/* Remove tuples from heap */
-		lazy_vacuum_heap(onerel, vacrelstats);
+		lazy_vacuum_heap_rel(vacrel);
 	}
 
 	/*
@@ -1774,47 +1780,47 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 	 * not there were indexes.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
 
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Do post-vacuum cleanup */
-	if (vacrelstats->useindex)
-		lazy_cleanup_all_indexes(Irel, vacrelstats, lps, nindexes);
+	if (vacrel->useindex)
+		lazy_cleanup_all_indexes(vacrel);
 
 	/*
-	 * End parallel mode before updating index statistics as we cannot write
-	 * during parallel mode.
+	 * Free resources managed by lazy_space_alloc().  (We must end parallel
+	 * mode/free shared memory before updating index statistics.  We cannot
+	 * write while in parallel mode.)
 	 */
-	if (ParallelVacuumIsActive(lps))
-		end_parallel_vacuum(vacrelstats->indstats, lps, nindexes);
+	lazy_space_free(vacrel);
 
 	/* Update index statistics */
-	if (vacrelstats->useindex)
-		update_index_statistics(Irel, vacrelstats->indstats, nindexes);
+	if (vacrel->useindex)
+		update_index_statistics(vacrel);
 
-	/* If no indexes, make log report that lazy_vacuum_heap would've made */
+	/* If no indexes, make log report that lazy_vacuum_heap_rel would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
 				(errmsg("\"%s\": removed %.0f row versions in %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						tups_vacuumed, vacuumed_pages)));
 
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 nkeep, OldestXmin);
+					 nkeep, vacrel->OldestXmin);
 	appendStringInfo(&buf, _("There were %.0f unused item identifiers.\n"),
 					 nunused);
 	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
 									"Skipped %u pages due to buffer pins, ",
-									vacrelstats->pinskipped_pages),
-					 vacrelstats->pinskipped_pages);
+									vacrel->pinskipped_pages),
+					 vacrel->pinskipped_pages);
 	appendStringInfo(&buf, ngettext("%u frozen page.\n",
 									"%u frozen pages.\n",
-									vacrelstats->frozenskipped_pages),
-					 vacrelstats->frozenskipped_pages);
+									vacrel->frozenskipped_pages),
+					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, ngettext("%u page is entirely empty.\n",
 									"%u pages are entirely empty.\n",
 									empty_pages),
@@ -1823,82 +1829,70 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 
 	ereport(elevel,
 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages",
-					vacrelstats->relname,
+					vacrel->relname,
 					tups_vacuumed, num_tuples,
-					vacrelstats->scanned_pages, nblocks),
+					vacrel->scanned_pages, nblocks),
 			 errdetail_internal("%s", buf.data)));
 	pfree(buf.data);
 }
 
 /*
- *	lazy_vacuum_all_indexes() -- vacuum all indexes of relation.
- *
- * We process the indexes serially unless we are doing parallel vacuum.
+ *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
  */
 static void
-lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
-						LVRelStats *vacrelstats, LVParallelState *lps,
-						int nindexes)
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
+	Assert(vacrel->nindexes > 0);
+	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
 	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(onerel, vacrelstats);
+	vacuum_log_cleanup_info(vacrel);
 
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
-	/* Perform index vacuuming with parallel workers for parallel vacuum. */
-	if (ParallelVacuumIsActive(lps))
+	if (!ParallelVacuumIsActive(vacrel))
 	{
-		/* Tell parallel workers to do index vacuuming */
-		lps->lvshared->for_cleanup = false;
-		lps->lvshared->first_time = false;
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-		/*
-		 * We can only provide an approximate value of num_heap_tuples in
-		 * vacuum cases.
-		 */
-		lps->lvshared->reltuples = vacrelstats->old_live_tuples;
-		lps->lvshared->estimated_count = true;
-
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+			vacrel->indstats[idx] =
+				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
+									  vacrel);
+		}
 	}
 	else
 	{
-		int			idx;
-
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_vacuum_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							  vacrelstats->dead_tuples,
-							  vacrelstats->old_live_tuples, vacrelstats);
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
 	/* Increase and report the number of index scans */
-	vacrelstats->num_index_scans++;
+	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
-								 vacrelstats->num_index_scans);
+								 vacrel->num_index_scans);
 }
 
-
 /*
- *	lazy_vacuum_heap() -- second pass over the heap
+ *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- *		This routine marks dead tuples as unused and compacts out free
- *		space on their pages.  Pages not having dead tuples recorded from
- *		lazy_scan_heap are not visited at all.
+ * This routine marks dead tuples as unused and compacts out free space on
+ * their pages.  Pages not having dead tuples recorded from lazy_scan_heap are
+ * not visited at all.
  *
- * Note: the reason for doing this as a second pass is we cannot remove
- * the tuples until we've removed their index entries, and we want to
- * process index entry removal in batches as large as possible.
+ * Note: the reason for doing this as a second pass is we cannot remove the
+ * tuples until we've removed their index entries, and we want to process
+ * index entry removal in batches as large as possible.
  */
 static void
-lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
+lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
 	int			tupindex;
-	int			npages;
+	int			vacuumed_pages;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
@@ -1908,14 +1902,15 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 								 PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
 
 	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	pg_rusage_init(&ru0);
-	npages = 0;
+	vacuumed_pages = 0;
 
 	tupindex = 0;
-	while (tupindex < vacrelstats->dead_tuples->num_tuples)
+	while (tupindex < vacrel->dead_tuples->num_tuples)
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -1924,30 +1919,30 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]);
-		vacrelstats->blkno = tblk;
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL,
-								 vac_strategy);
+		tblk = ItemPointerGetBlockNumber(&vacrel->dead_tuples->itemptrs[tupindex]);
+		vacrel->blkno = tblk;
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+								 vacrel->bstrategy);
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
 			ReleaseBuffer(buf);
 			++tupindex;
 			continue;
 		}
-		tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats,
-									&vmbuffer);
+		tupindex = lazy_vacuum_heap_page(vacrel, tblk, buf, tupindex,
+										 &vmbuffer);
 
 		/* Now that we've compacted the page, record its available space */
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(onerel, tblk, freespace);
-		npages++;
+		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
+		vacuumed_pages++;
 	}
 
 	/* Clear the block number information */
-	vacrelstats->blkno = InvalidBlockNumber;
+	vacrel->blkno = InvalidBlockNumber;
 
 	if (BufferIsValid(vmbuffer))
 	{
@@ -1956,32 +1951,31 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	}
 
 	ereport(elevel,
-			(errmsg("\"%s\": removed %d row versions in %d pages",
-					vacrelstats->relname,
-					tupindex, npages),
+			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
+					vacrel->relname, tupindex, vacuumed_pages),
 			 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
+	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_page() -- free dead tuples on a page
- *					 and repair its fragmentation.
+ *	lazy_vacuum_heap_page() -- free dead tuples on a page
+ *						  and repair its fragmentation.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
  *
- * tupindex is the index in vacrelstats->dead_tuples of the first dead
- * tuple for this page.  We assume the rest follow sequentially.
- * The return value is the first tupindex after the tuples of this page.
+ * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
+ * this page.  We assume the rest follow sequentially.  The return value is
+ * the first tupindex after the tuples of this page.
  */
 static int
-lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-				 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
+					  int tupindex, Buffer *vmbuffer)
 {
-	LVDeadTuples *dead_tuples = vacrelstats->dead_tuples;
+	LVDeadTuples *dead_tuples = vacrel->dead_tuples;
 	Page		page = BufferGetPage(buffer);
-	OffsetNumber unused[MaxOffsetNumber];
+	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
@@ -1990,8 +1984,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
 	/* Update error traceback information */
-	update_vacuum_error_info(vacrelstats, &saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
-							 blkno, InvalidOffsetNumber);
+	update_vacuum_error_info(vacrel, &saved_err_info,
+							 VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
+							 InvalidOffsetNumber);
 
 	START_CRIT_SECTION();
 
@@ -2018,14 +2013,14 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (RelationNeedsWAL(vacrel->rel))
 	{
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(onerel, buffer,
+		recptr = log_heap_clean(vacrel->rel, buffer,
 								NULL, 0, NULL, 0,
 								unused, uncnt,
-								vacrelstats->latestRemovedXid);
+								vacrel->latestRemovedXid);
 		PageSetLSN(page, recptr);
 	}
 
@@ -2043,8 +2038,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted in the log_heap_clean() above.
 	 */
-	if (heap_page_is_all_visible(onerel, buffer, vacrelstats,
-								 &visibility_cutoff_xid,
+	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
 		PageSetAllVisible(page);
 
@@ -2055,8 +2049,9 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	 */
 	if (PageIsAllVisible(page))
 	{
-		uint8		vm_status = visibilitymap_get_status(onerel, blkno, vmbuffer);
 		uint8		flags = 0;
+		uint8		vm_status = visibilitymap_get_status(vacrel->rel,
+														 blkno, vmbuffer);
 
 		/* Set the VM all-frozen bit to flag, if needed */
 		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
@@ -2066,12 +2061,12 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 
 		Assert(BufferIsValid(*vmbuffer));
 		if (flags != 0)
-			visibilitymap_set(onerel, blkno, buffer, InvalidXLogRecPtr,
+			visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
 							  *vmbuffer, visibility_cutoff_xid, flags);
 	}
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
+	restore_vacuum_error_info(vacrel, &saved_err_info);
 	return tupindex;
 }
 
@@ -2083,7 +2078,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  * Also returns a flag indicating whether page contains any tuples at all.
  */
 static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
+lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 {
 	Page		page = BufferGetPage(buf);
 	OffsetNumber offnum,
@@ -2112,7 +2107,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* this should match hastup test in count_nondeletable_pages() */
@@ -2125,17 +2120,68 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
 
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 
-		if (heap_tuple_needs_freeze(tupleheader, FreezeLimit,
-									MultiXactCutoff, buf))
+		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
 			break;
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return (offnum <= maxoff);
 }
 
+static void
+do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel)
+{
+	/* Tell parallel workers to do index vacuuming */
+	vacrel->lps->lvshared->for_cleanup = false;
+	vacrel->lps->lvshared->first_time = false;
+
+	/*
+	 * We can only provide an approximate value of num_heap_tuples in vacuum
+	 * cases.
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->old_live_tuples;
+	vacrel->lps->lvshared->estimated_count = true;
+
+	do_parallel_vacuum_or_cleanup(vacrel,
+								  vacrel->lps->nindexes_parallel_bulkdel);
+}
+
+static void
+do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
+{
+	int			nworkers;
+
+	/*
+	 * If parallel vacuum is active we perform index cleanup with parallel
+	 * workers.
+	 *
+	 * Tell parallel workers to do index cleanup.
+	 */
+	vacrel->lps->lvshared->for_cleanup = true;
+	vacrel->lps->lvshared->first_time = (vacrel->num_index_scans == 0);
+
+	/*
+	 * Now we can provide a better estimate of total number of surviving
+	 * tuples (we assume indexes are more interested in that than in the
+	 * number of nominally live tuples).
+	 */
+	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
+	vacrel->lps->lvshared->estimated_count =
+		(vacrel->tupcount_pages < vacrel->rel_pages);
+
+	/* Determine the number of parallel workers to launch */
+	if (vacrel->lps->lvshared->first_time)
+		nworkers = vacrel->lps->nindexes_parallel_cleanup +
+			vacrel->lps->nindexes_parallel_condcleanup;
+	else
+		nworkers = vacrel->lps->nindexes_parallel_cleanup;
+
+	do_parallel_vacuum_or_cleanup(vacrel, nworkers);
+}
+
 /*
  * Perform index vacuum or index cleanup with parallel workers.  This function
  * must be used by the parallel vacuum leader process.  The caller must set
@@ -2143,26 +2189,13 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelStats *vacrelstats)
  * cleanup.
  */
 static void
-lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
-							 LVParallelState *lps, int nindexes)
+do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers)
 {
-	int			nworkers;
+	LVParallelState *lps = vacrel->lps;
 
 	Assert(!IsParallelWorker());
-	Assert(ParallelVacuumIsActive(lps));
-	Assert(nindexes > 0);
-
-	/* Determine the number of parallel workers to launch */
-	if (lps->lvshared->for_cleanup)
-	{
-		if (lps->lvshared->first_time)
-			nworkers = lps->nindexes_parallel_cleanup +
-				lps->nindexes_parallel_condcleanup;
-		else
-			nworkers = lps->nindexes_parallel_cleanup;
-	}
-	else
-		nworkers = lps->nindexes_parallel_bulkdel;
+	Assert(ParallelVacuumIsActive(vacrel));
+	Assert(vacrel->nindexes > 0);
 
 	/* The leader process will participate */
 	nworkers--;
@@ -2177,7 +2210,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
 	/* Setup the shared cost-based vacuum delay and launch workers */
 	if (nworkers > 0)
 	{
-		if (vacrelstats->num_index_scans > 0)
+		if (vacrel->num_index_scans > 0)
 		{
 			/* Reset the parallel index processing counter */
 			pg_atomic_write_u32(&(lps->lvshared->idx), 0);
@@ -2232,14 +2265,13 @@ lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
 	}
 
 	/* Process the indexes that can be processed by only leader process */
-	vacuum_indexes_leader(Irel, vacrelstats, lps, nindexes);
+	do_serial_processing_for_unsafe_indexes(vacrel, lps->lvshared);
 
 	/*
 	 * Join as a parallel worker.  The leader process alone processes all the
 	 * indexes in the case where no workers are launched.
 	 */
-	parallel_vacuum_index(Irel, lps->lvshared, vacrelstats->dead_tuples,
-						  nindexes, vacrelstats);
+	do_parallel_processing(vacrel, lps->lvshared);
 
 	/*
 	 * Next, accumulate buffer and WAL usage.  (This must wait for the workers
@@ -2247,12 +2279,10 @@ lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
 	 */
 	if (nworkers > 0)
 	{
-		int			i;
-
 		/* Wait for all vacuum workers to finish */
 		WaitForParallelWorkersToFinish(lps->pcxt);
 
-		for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+		for (int i = 0; i < lps->pcxt->nworkers_launched; i++)
 			InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
 	}
 
@@ -2272,9 +2302,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel, LVRelStats *vacrelstats,
  * vacuum worker processes to process the indexes in parallel.
  */
 static void
-parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
-					  LVDeadTuples *dead_tuples, int nindexes,
-					  LVRelStats *vacrelstats)
+do_parallel_processing(LVRelState *vacrel, LVShared *lvshared)
 {
 	/*
 	 * Increment the active worker count if we are able to launch any worker.
@@ -2286,29 +2314,39 @@ parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
 	for (;;)
 	{
 		int			idx;
-		LVSharedIndStats *shared_indstats;
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
 
 		/* Get an index number to process */
 		idx = pg_atomic_fetch_add_u32(&(lvshared->idx), 1);
 
 		/* Done for all indexes? */
-		if (idx >= nindexes)
+		if (idx >= vacrel->nindexes)
 			break;
 
 		/* Get the index statistics of this index from DSM */
-		shared_indstats = get_indstats(lvshared, idx);
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
+
+		/* Skip indexes not participating in parallelism */
+		if (shared_istat == NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
 
 		/*
-		 * Skip processing indexes that don't participate in parallel
-		 * operation
+		 * Skip processing indexes that are unsafe for workers (these are
+		 * processed in do_serial_processing_for_unsafe_indexes() by leader)
 		 */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[idx], lvshared))
+		if (!parallel_processing_is_safe(indrel, lvshared))
 			continue;
 
 		/* Do vacuum or cleanup of the index */
-		vacuum_one_index(Irel[idx], &(vacrelstats->indstats[idx]), lvshared,
-						 shared_indstats, dead_tuples, vacrelstats);
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
 	}
 
 	/*
@@ -2324,11 +2362,8 @@ parallel_vacuum_index(Relation *Irel, LVShared *lvshared,
  * because these indexes don't support parallel operation at that phase.
  */
 static void
-vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
-					  LVParallelState *lps, int nindexes)
+do_serial_processing_for_unsafe_indexes(LVRelState *vacrel, LVShared *lvshared)
 {
-	int			i;
-
 	Assert(!IsParallelWorker());
 
 	/*
@@ -2337,18 +2372,32 @@ vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
 	if (VacuumActiveNWorkers)
 		pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
 
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
 	{
-		LVSharedIndStats *shared_indstats;
+		LVSharedIndStats *shared_istat;
+		Relation	indrel;
+		IndexBulkDeleteResult *istat;
 
-		shared_indstats = get_indstats(lps->lvshared, i);
+		shared_istat = parallel_stats_for_idx(lvshared, idx);
 
-		/* Process the indexes skipped by parallel workers */
-		if (shared_indstats == NULL ||
-			skip_parallel_vacuum_index(Irel[i], lps->lvshared))
-			vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared,
-							 shared_indstats, vacrelstats->dead_tuples,
-							 vacrelstats);
+		/* Skip already-complete indexes */
+		if (shared_istat != NULL)
+			continue;
+
+		indrel = vacrel->indrels[idx];
+
+		/*
+		 * We're only here for the unsafe indexes
+		 */
+		if (parallel_processing_is_safe(indrel, lvshared))
+			continue;
+
+		/* Do vacuum or cleanup of the index */
+		istat = (vacrel->indstats[idx]);
+		vacrel->indstats[idx] = parallel_process_one_index(indrel, istat,
+														   lvshared,
+														   shared_istat,
+														   vacrel);
 	}
 
 	/*
@@ -2365,33 +2414,35 @@ vacuum_indexes_leader(Relation *Irel, LVRelStats *vacrelstats,
  * statistics returned from ambulkdelete and amvacuumcleanup to the DSM
  * segment.
  */
-static void
-vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
-				 LVShared *lvshared, LVSharedIndStats *shared_indstats,
-				 LVDeadTuples *dead_tuples, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+parallel_process_one_index(Relation indrel,
+						   IndexBulkDeleteResult *istat,
+						   LVShared *lvshared,
+						   LVSharedIndStats *shared_istat,
+						   LVRelState *vacrel)
 {
 	IndexBulkDeleteResult *bulkdelete_res = NULL;
 
-	if (shared_indstats)
+	if (shared_istat)
 	{
 		/* Get the space for IndexBulkDeleteResult */
-		bulkdelete_res = &(shared_indstats->stats);
+		bulkdelete_res = &(shared_istat->istat);
 
 		/*
 		 * Update the pointer to the corresponding bulk-deletion result if
 		 * someone has already updated it.
 		 */
-		if (shared_indstats->updated && *stats == NULL)
-			*stats = bulkdelete_res;
+		if (shared_istat->updated && istat == NULL)
+			istat = bulkdelete_res;
 	}
 
 	/* Do vacuum or cleanup of the index */
 	if (lvshared->for_cleanup)
-		lazy_cleanup_index(indrel, stats, lvshared->reltuples,
-						   lvshared->estimated_count, vacrelstats);
+		istat = lazy_cleanup_one_index(indrel, istat, lvshared->reltuples,
+									   lvshared->estimated_count, vacrel);
 	else
-		lazy_vacuum_index(indrel, stats, dead_tuples,
-						  lvshared->reltuples, vacrelstats);
+		istat = lazy_vacuum_one_index(indrel, istat, lvshared->reltuples,
+									  vacrel);
 
 	/*
 	 * Copy the index bulk-deletion result returned from ambulkdelete and
@@ -2405,83 +2456,71 @@ vacuum_one_index(Relation indrel, IndexBulkDeleteResult **stats,
 	 * Since all vacuum workers write the bulk-deletion result at different
 	 * slots we can write them without locking.
 	 */
-	if (shared_indstats && !shared_indstats->updated && *stats != NULL)
+	if (shared_istat && !shared_istat->updated && istat != NULL)
 	{
-		memcpy(bulkdelete_res, *stats, sizeof(IndexBulkDeleteResult));
-		shared_indstats->updated = true;
+		memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+		shared_istat->updated = true;
 
 		/*
-		 * Now that stats[idx] points to the DSM segment, we don't need the
-		 * locally allocated results.
+		 * Now that top-level indstats[idx] points to the DSM segment, we
+		 * don't need the locally allocated results.
 		 */
-		pfree(*stats);
-		*stats = bulkdelete_res;
+		pfree(istat);
+		istat = bulkdelete_res;
 	}
+
+	return istat;
 }
 
 /*
  *	lazy_cleanup_all_indexes() -- cleanup all indexes of relation.
- *
- * Cleanup indexes.  We process the indexes serially unless we are doing
- * parallel vacuum.
  */
 static void
-lazy_cleanup_all_indexes(Relation *Irel, LVRelStats *vacrelstats,
-						 LVParallelState *lps, int nindexes)
+lazy_cleanup_all_indexes(LVRelState *vacrel)
 {
-	int			idx;
-
-	Assert(!IsParallelWorker());
-	Assert(nindexes > 0);
+	Assert(vacrel->nindexes > 0);
 
 	/* Report that we are now cleaning up indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
 
-	/*
-	 * If parallel vacuum is active we perform index cleanup with parallel
-	 * workers.
-	 */
-	if (ParallelVacuumIsActive(lps))
+	if (!ParallelVacuumIsActive(vacrel))
 	{
-		/* Tell parallel workers to do index cleanup */
-		lps->lvshared->for_cleanup = true;
-		lps->lvshared->first_time =
-			(vacrelstats->num_index_scans == 0);
+		double		reltuples = vacrel->new_rel_tuples;
+		bool		estimated_count =
+		vacrel->tupcount_pages < vacrel->rel_pages;
 
-		/*
-		 * Now we can provide a better estimate of total number of surviving
-		 * tuples (we assume indexes are more interested in that than in the
-		 * number of nominally live tuples).
-		 */
-		lps->lvshared->reltuples = vacrelstats->new_rel_tuples;
-		lps->lvshared->estimated_count =
-			(vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+		for (int idx = 0; idx < vacrel->nindexes; idx++)
+		{
+			Relation	indrel = vacrel->indrels[idx];
+			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-		lazy_parallel_vacuum_indexes(Irel, vacrelstats, lps, nindexes);
+			vacrel->indstats[idx] =
+				lazy_cleanup_one_index(indrel, istat, reltuples,
+									   estimated_count, vacrel);
+		}
 	}
 	else
 	{
-		for (idx = 0; idx < nindexes; idx++)
-			lazy_cleanup_index(Irel[idx], &(vacrelstats->indstats[idx]),
-							   vacrelstats->new_rel_tuples,
-							   vacrelstats->tupcount_pages < vacrelstats->rel_pages,
-							   vacrelstats);
+		/* Outsource everything to parallel variant */
+		do_parallel_lazy_cleanup_all_indexes(vacrel);
 	}
 }
 
 /*
- *	lazy_vacuum_index() -- vacuum one index relation.
+ *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index entries pointing to tuples listed in
  *		dead_tuples, and update running statistics.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-				  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					  double reltuples, LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2495,7 +2534,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	ivinfo.estimated_count = true;
 	ivinfo.message_level = elevel;
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2503,38 +2542,41 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_VACUUM_INDEX,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	*stats = index_bulk_delete(&ivinfo, *stats,
-							   lazy_tid_reaped, (void *) dead_tuples);
+	istat = index_bulk_delete(&ivinfo, istat, lazy_tid_reaped,
+							  (void *) vacrel->dead_tuples);
 
 	ereport(elevel,
 			(errmsg("scanned index \"%s\" to remove %d row versions",
-					vacrelstats->indname,
-					dead_tuples->num_tuples),
+					vacrel->indname, vacrel->dead_tuples->num_tuples),
 			 errdetail_internal("%s", pg_rusage_show(&ru0))));
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
 }
 
 /*
- *	lazy_cleanup_index() -- do post-vacuum cleanup for one index relation.
+ *	lazy_cleanup_one_index() -- do post-vacuum cleanup for index relation.
  *
  *		reltuples is the number of heap tuples and estimated_count is true
  *		if reltuples is an estimated value.
+ *
+ * Returns bulk delete stats derived from input stats
  */
-static void
-lazy_cleanup_index(Relation indrel,
-				   IndexBulkDeleteResult **stats,
-				   double reltuples, bool estimated_count, LVRelStats *vacrelstats)
+static IndexBulkDeleteResult *
+lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
+					   double reltuples, bool estimated_count,
+					   LVRelState *vacrel)
 {
 	IndexVacuumInfo ivinfo;
 	PGRUsage	ru0;
@@ -2549,7 +2591,7 @@ lazy_cleanup_index(Relation indrel,
 	ivinfo.message_level = elevel;
 
 	ivinfo.num_heap_tuples = reltuples;
-	ivinfo.strategy = vac_strategy;
+	ivinfo.strategy = vacrel->bstrategy;
 
 	/*
 	 * Update error traceback information.
@@ -2557,35 +2599,37 @@ lazy_cleanup_index(Relation indrel,
 	 * The index name is saved during this phase and restored immediately
 	 * after this phase.  See vacuum_error_callback.
 	 */
-	Assert(vacrelstats->indname == NULL);
-	vacrelstats->indname = pstrdup(RelationGetRelationName(indrel));
-	update_vacuum_error_info(vacrelstats, &saved_err_info,
+	Assert(vacrel->indname == NULL);
+	vacrel->indname = pstrdup(RelationGetRelationName(indrel));
+	update_vacuum_error_info(vacrel, &saved_err_info,
 							 VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	*stats = index_vacuum_cleanup(&ivinfo, *stats);
+	istat = index_vacuum_cleanup(&ivinfo, istat);
 
-	if (*stats)
+	if (istat)
 	{
 		ereport(elevel,
 				(errmsg("index \"%s\" now contains %.0f row versions in %u pages",
 						RelationGetRelationName(indrel),
-						(*stats)->num_index_tuples,
-						(*stats)->num_pages),
+						(istat)->num_index_tuples,
+						(istat)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
 						   "%u index pages were newly deleted.\n"
 						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
-						   (*stats)->tuples_removed,
-						   (*stats)->pages_newly_deleted,
-						   (*stats)->pages_deleted, (*stats)->pages_free,
+						   (istat)->tuples_removed,
+						   (istat)->pages_newly_deleted,
+						   (istat)->pages_deleted, (istat)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
 
 	/* Revert to the previous phase information for error traceback */
-	restore_vacuum_error_info(vacrelstats, &saved_err_info);
-	pfree(vacrelstats->indname);
-	vacrelstats->indname = NULL;
+	restore_vacuum_error_info(vacrel, &saved_err_info);
+	pfree(vacrel->indname);
+	vacrel->indname = NULL;
+
+	return istat;
 }
 
 /*
@@ -2608,17 +2652,17 @@ lazy_cleanup_index(Relation indrel,
  * careful to depend only on fields that lazy_scan_heap updates on-the-fly.
  */
 static bool
-should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
+should_attempt_truncation(LVRelState *vacrel, VacuumParams *params)
 {
 	BlockNumber possibly_freeable;
 
 	if (params->truncate == VACOPT_TERNARY_DISABLED)
 		return false;
 
-	possibly_freeable = vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+	possibly_freeable = vacrel->rel_pages - vacrel->nonempty_pages;
 	if (possibly_freeable > 0 &&
 		(possibly_freeable >= REL_TRUNCATE_MINIMUM ||
-		 possibly_freeable >= vacrelstats->rel_pages / REL_TRUNCATE_FRACTION) &&
+		 possibly_freeable >= vacrel->rel_pages / REL_TRUNCATE_FRACTION) &&
 		old_snapshot_threshold < 0)
 		return true;
 	else
@@ -2629,9 +2673,9 @@ should_attempt_truncation(VacuumParams *params, LVRelStats *vacrelstats)
  * lazy_truncate_heap - try to truncate off any empty pages at the end
  */
 static void
-lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
+lazy_truncate_heap(LVRelState *vacrel)
 {
-	BlockNumber old_rel_pages = vacrelstats->rel_pages;
+	BlockNumber old_rel_pages = vacrel->rel_pages;
 	BlockNumber new_rel_pages;
 	int			lock_retry;
 
@@ -2655,11 +2699,11 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * (which is quite possible considering we already hold a lower-grade
 		 * lock).
 		 */
-		vacrelstats->lock_waiter_detected = false;
+		vacrel->lock_waiter_detected = false;
 		lock_retry = 0;
 		while (true)
 		{
-			if (ConditionalLockRelation(onerel, AccessExclusiveLock))
+			if (ConditionalLockRelation(vacrel->rel, AccessExclusiveLock))
 				break;
 
 			/*
@@ -2675,10 +2719,10 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 				 * We failed to establish the lock in the specified number of
 				 * retries. This means we give up truncating.
 				 */
-				vacrelstats->lock_waiter_detected = true;
+				vacrel->lock_waiter_detected = true;
 				ereport(elevel,
 						(errmsg("\"%s\": stopping truncate due to conflicting lock request",
-								vacrelstats->relname)));
+								vacrel->relname)));
 				return;
 			}
 
@@ -2690,17 +2734,17 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * whilst we were vacuuming with non-exclusive lock.  If so, give up;
 		 * the newly added pages presumably contain non-deletable tuples.
 		 */
-		new_rel_pages = RelationGetNumberOfBlocks(onerel);
+		new_rel_pages = RelationGetNumberOfBlocks(vacrel->rel);
 		if (new_rel_pages != old_rel_pages)
 		{
 			/*
-			 * Note: we intentionally don't update vacrelstats->rel_pages with
-			 * the new rel size here.  If we did, it would amount to assuming
-			 * that the new pages are empty, which is unlikely. Leaving the
-			 * numbers alone amounts to assuming that the new pages have the
-			 * same tuple density as existing ones, which is less unlikely.
+			 * Note: we intentionally don't update vacrel->rel_pages with the
+			 * new rel size here.  If we did, it would amount to assuming that
+			 * the new pages are empty, which is unlikely. Leaving the numbers
+			 * alone amounts to assuming that the new pages have the same
+			 * tuple density as existing ones, which is less unlikely.
 			 */
-			UnlockRelation(onerel, AccessExclusiveLock);
+			UnlockRelation(vacrel->rel, AccessExclusiveLock);
 			return;
 		}
 
@@ -2710,20 +2754,20 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * other backends could have added tuples to these pages whilst we
 		 * were vacuuming.
 		 */
-		new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
-		vacrelstats->blkno = new_rel_pages;
+		new_rel_pages = count_nondeletable_pages(vacrel);
+		vacrel->blkno = new_rel_pages;
 
 		if (new_rel_pages >= old_rel_pages)
 		{
 			/* can't do anything after all */
-			UnlockRelation(onerel, AccessExclusiveLock);
+			UnlockRelation(vacrel->rel, AccessExclusiveLock);
 			return;
 		}
 
 		/*
 		 * Okay to truncate.
 		 */
-		RelationTruncate(onerel, new_rel_pages);
+		RelationTruncate(vacrel->rel, new_rel_pages);
 
 		/*
 		 * We can release the exclusive lock as soon as we have truncated.
@@ -2732,25 +2776,25 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 		 * that should happen as part of standard invalidation processing once
 		 * they acquire lock on the relation.
 		 */
-		UnlockRelation(onerel, AccessExclusiveLock);
+		UnlockRelation(vacrel->rel, AccessExclusiveLock);
 
 		/*
 		 * Update statistics.  Here, it *is* correct to adjust rel_pages
 		 * without also touching reltuples, since the tuple count wasn't
 		 * changed by the truncation.
 		 */
-		vacrelstats->pages_removed += old_rel_pages - new_rel_pages;
-		vacrelstats->rel_pages = new_rel_pages;
+		vacrel->pages_removed += old_rel_pages - new_rel_pages;
+		vacrel->rel_pages = new_rel_pages;
 
 		ereport(elevel,
 				(errmsg("\"%s\": truncated %u to %u pages",
-						vacrelstats->relname,
+						vacrel->relname,
 						old_rel_pages, new_rel_pages),
 				 errdetail_internal("%s",
 									pg_rusage_show(&ru0))));
 		old_rel_pages = new_rel_pages;
-	} while (new_rel_pages > vacrelstats->nonempty_pages &&
-			 vacrelstats->lock_waiter_detected);
+	} while (new_rel_pages > vacrel->nonempty_pages &&
+			 vacrel->lock_waiter_detected);
 }
 
 /*
@@ -2759,7 +2803,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
  * Returns number of nondeletable pages (last nonempty page + 1).
  */
 static BlockNumber
-count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
+count_nondeletable_pages(LVRelState *vacrel)
 {
 	BlockNumber blkno;
 	BlockNumber prefetchedUntil;
@@ -2774,11 +2818,11 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * unsigned.)  To make the scan faster, we prefetch a few blocks at a time
 	 * in forward direction, so that OS-level readahead can kick in.
 	 */
-	blkno = vacrelstats->rel_pages;
+	blkno = vacrel->rel_pages;
 	StaticAssertStmt((PREFETCH_SIZE & (PREFETCH_SIZE - 1)) == 0,
 					 "prefetch size must be power of 2");
 	prefetchedUntil = InvalidBlockNumber;
-	while (blkno > vacrelstats->nonempty_pages)
+	while (blkno > vacrel->nonempty_pages)
 	{
 		Buffer		buf;
 		Page		page;
@@ -2805,13 +2849,13 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
 				>= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
 			{
-				if (LockHasWaitersRelation(onerel, AccessExclusiveLock))
+				if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
 				{
 					ereport(elevel,
 							(errmsg("\"%s\": suspending truncate due to conflicting lock request",
-									vacrelstats->relname)));
+									vacrel->relname)));
 
-					vacrelstats->lock_waiter_detected = true;
+					vacrel->lock_waiter_detected = true;
 					return blkno;
 				}
 				starttime = currenttime;
@@ -2836,14 +2880,14 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 			prefetchStart = blkno & ~(PREFETCH_SIZE - 1);
 			for (pblkno = prefetchStart; pblkno <= blkno; pblkno++)
 			{
-				PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+				PrefetchBuffer(vacrel->rel, MAIN_FORKNUM, pblkno);
 				CHECK_FOR_INTERRUPTS();
 			}
 			prefetchedUntil = prefetchStart;
 		}
 
-		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
-								 RBM_NORMAL, vac_strategy);
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 vacrel->bstrategy);
 
 		/* In this phase we only need shared access to the buffer */
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
@@ -2891,7 +2935,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 	 * pages still are; we need not bother to look at the last known-nonempty
 	 * page.
 	 */
-	return vacrelstats->nonempty_pages;
+	return vacrel->nonempty_pages;
 }
 
 /*
@@ -2930,18 +2974,64 @@ compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
  * See the comments at the head of this file for rationale.
  */
 static void
-lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
+lazy_space_alloc(LVRelState *vacrel, int nworkers, BlockNumber nblocks)
 {
-	LVDeadTuples *dead_tuples = NULL;
+	LVDeadTuples *dead_tuples;
 	long		maxtuples;
 
-	maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex);
+	/*
+	 * Initialize state for a parallel vacuum.  As of now, only one worker can
+	 * be used for an index, so we invoke parallelism only if there are at
+	 * least two indexes on a table.
+	 */
+	if (nworkers >= 0 && vacrel->nindexes > 1)
+	{
+		/*
+		 * Since parallel workers cannot access data in temporary tables, we
+		 * can't perform parallel vacuum on them.
+		 */
+		if (RelationUsesLocalBuffers(vacrel->rel))
+		{
+			/*
+			 * Give warning only if the user explicitly tries to perform a
+			 * parallel vacuum on the temporary table.
+			 */
+			if (nworkers > 0)
+				ereport(WARNING,
+						(errmsg("disabling parallel option of vacuum on \"%s\" --- cannot vacuum temporary tables in parallel",
+								vacrel->relname)));
+		}
+		else
+			vacrel->lps = begin_parallel_vacuum(vacrel, nblocks, nworkers);
+
+		/* If parallel mode started, we're done */
+		if (ParallelVacuumIsActive(vacrel))
+			return;
+	}
+
+	maxtuples = compute_max_dead_tuples(nblocks, vacrel->nindexes > 0);
 
 	dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples));
 	dead_tuples->num_tuples = 0;
 	dead_tuples->max_tuples = (int) maxtuples;
 
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
+}
+
+/*
+ * lazy_space_free - free space allocated in lazy_space_alloc
+ */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+	if (!ParallelVacuumIsActive(vacrel))
+		return;
+
+	/*
+	 * End parallel mode before updating index statistics as we cannot write
+	 * during parallel mode.
+	 */
+	end_parallel_vacuum(vacrel);
 }
 
 /*
@@ -3039,8 +3129,7 @@ vac_cmp_itemptr(const void *left, const void *right)
  * on this page is frozen.
  */
 static bool
-heap_page_is_all_visible(Relation rel, Buffer buf,
-						 LVRelStats *vacrelstats,
+heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 						 TransactionId *visibility_cutoff_xid,
 						 bool *all_frozen)
 {
@@ -3069,7 +3158,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 		 * Set the offset number so that we can display it along with any
 		 * error that occurred while processing this tuple.
 		 */
-		vacrelstats->offnum = offnum;
+		vacrel->offnum = offnum;
 		itemid = PageGetItemId(page, offnum);
 
 		/* Unused or redirect line pointers are of no interest */
@@ -3093,9 +3182,9 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
 		tuple.t_len = ItemIdGetLength(itemid);
-		tuple.t_tableOid = RelationGetRelid(rel);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3114,7 +3203,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, OldestXmin))
+					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
@@ -3148,7 +3237,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
 	}							/* scan along page */
 
 	/* Clear the offset information once we have processed the given page. */
-	vacrelstats->offnum = InvalidOffsetNumber;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	return all_visible;
 }
@@ -3167,14 +3256,13 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
  * vacuum.
  */
 static int
-compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
+compute_parallel_vacuum_workers(LVRelState *vacrel, int nrequested,
 								bool *can_parallel_vacuum)
 {
 	int			nindexes_parallel = 0;
 	int			nindexes_parallel_bulkdel = 0;
 	int			nindexes_parallel_cleanup = 0;
 	int			parallel_workers;
-	int			i;
 
 	/*
 	 * We don't allow performing parallel operation in standalone backend or
@@ -3186,15 +3274,16 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	/*
 	 * Compute the number of indexes that can participate in parallel vacuum.
 	 */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < vacrel->nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = vacrel->indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		if (vacoptions == VACUUM_OPTION_NO_PARALLEL ||
-			RelationGetNumberOfBlocks(Irel[i]) < min_parallel_index_scan_size)
+			RelationGetNumberOfBlocks(indrel) < min_parallel_index_scan_size)
 			continue;
 
-		can_parallel_vacuum[i] = true;
+		can_parallel_vacuum[idx] = true;
 
 		if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) != 0)
 			nindexes_parallel_bulkdel++;
@@ -3223,52 +3312,30 @@ compute_parallel_vacuum_workers(Relation *Irel, int nindexes, int nrequested,
 	return parallel_workers;
 }
 
-/*
- * Initialize variables for shared index statistics, set NULL bitmap and the
- * size of stats for each index.
- */
-static void
-prepare_index_statistics(LVShared *lvshared, bool *can_parallel_vacuum,
-						 int nindexes)
-{
-	int			i;
-
-	/* Currently, we don't support parallel vacuum for autovacuum */
-	Assert(!IsAutoVacuumWorkerProcess());
-
-	/* Set NULL for all indexes */
-	memset(lvshared->bitmap, 0x00, BITMAPLEN(nindexes));
-
-	for (i = 0; i < nindexes; i++)
-	{
-		if (!can_parallel_vacuum[i])
-			continue;
-
-		/* Set NOT NULL as this index does support parallelism */
-		lvshared->bitmap[i >> 3] |= 1 << (i & 0x07);
-	}
-}
-
 /*
  * Update index statistics in pg_class if the statistics are accurate.
  */
 static void
-update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
-						int nindexes)
+update_index_statistics(LVRelState *vacrel)
 {
-	int			i;
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
 
 	Assert(!IsInParallelMode());
 
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		if (stats[i] == NULL || stats[i]->estimated_count)
+		Relation	indrel = indrels[idx];
+		IndexBulkDeleteResult *istat = indstats[idx];
+
+		if (istat == NULL || istat->estimated_count)
 			continue;
 
 		/* Update index statistics */
-		vac_update_relstats(Irel[i],
-							stats[i]->num_pages,
-							stats[i]->num_index_tuples,
+		vac_update_relstats(indrel,
+							istat->num_pages,
+							istat->num_index_tuples,
 							0,
 							false,
 							InvalidTransactionId,
@@ -3283,10 +3350,12 @@ update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats,
  * create a parallel context, and then initialize the DSM segment.
  */
 static LVParallelState *
-begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
-					  BlockNumber nblocks, int nindexes, int nrequested)
+begin_parallel_vacuum(LVRelState *vacrel, BlockNumber nblocks,
+					  int nrequested)
 {
 	LVParallelState *lps = NULL;
+	Relation   *indrels = vacrel->indrels;
+	int			nindexes = vacrel->nindexes;
 	ParallelContext *pcxt;
 	LVShared   *shared;
 	LVDeadTuples *dead_tuples;
@@ -3299,7 +3368,6 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
-	int			i;
 
 	/*
 	 * A parallel vacuum must be requested and there must be indexes on the
@@ -3312,7 +3380,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	 * Compute the number of parallel vacuum workers to launch
 	 */
 	can_parallel_vacuum = (bool *) palloc0(sizeof(bool) * nindexes);
-	parallel_workers = compute_parallel_vacuum_workers(Irel, nindexes,
+	parallel_workers = compute_parallel_vacuum_workers(vacrel,
 													   nrequested,
 													   can_parallel_vacuum);
 
@@ -3333,9 +3401,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 
 	/* Estimate size for shared information -- PARALLEL_VACUUM_KEY_SHARED */
 	est_shared = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		uint8		vacoptions = Irel[i]->rd_indam->amparallelvacuumoptions;
+		Relation	indrel = indrels[idx];
+		uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
 		/*
 		 * Cleanup option should be either disabled, always performing in
@@ -3346,10 +3415,10 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 		Assert(vacoptions <= VACUUM_OPTION_MAX_VALID_VALUE);
 
 		/* Skip indexes that don't participate in parallel vacuum */
-		if (!can_parallel_vacuum[i])
+		if (!can_parallel_vacuum[idx])
 			continue;
 
-		if (Irel[i]->rd_indam->amusemaintenanceworkmem)
+		if (indrel->rd_indam->amusemaintenanceworkmem)
 			nindexes_mwm++;
 
 		est_shared = add_size(est_shared, sizeof(LVSharedIndStats));
@@ -3404,7 +3473,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	/* Prepare shared information */
 	shared = (LVShared *) shm_toc_allocate(pcxt->toc, est_shared);
 	MemSet(shared, 0, est_shared);
-	shared->relid = relid;
+	shared->relid = RelationGetRelid(vacrel->rel);
 	shared->elevel = elevel;
 	shared->maintenance_work_mem_worker =
 		(nindexes_mwm > 0) ?
@@ -3415,7 +3484,20 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
 	pg_atomic_init_u32(&(shared->idx), 0);
 	shared->offset = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN(nindexes)));
-	prepare_index_statistics(shared, can_parallel_vacuum, nindexes);
+
+	/*
+	 * Initialize variables for shared index statistics, set NULL bitmap and
+	 * the size of stats for each index.
+	 */
+	memset(shared->bitmap, 0x00, BITMAPLEN(nindexes));
+	for (int idx = 0; idx < nindexes; idx++)
+	{
+		if (!can_parallel_vacuum[idx])
+			continue;
+
+		/* Set NOT NULL as this index does support parallelism */
+		shared->bitmap[idx >> 3] |= 1 << (idx & 0x07);
+	}
 
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	lps->lvshared = shared;
@@ -3426,7 +3508,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
 	dead_tuples->num_tuples = 0;
 	MemSet(dead_tuples->itemptrs, 0, sizeof(ItemPointerData) * maxtuples);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_TUPLES, dead_tuples);
-	vacrelstats->dead_tuples = dead_tuples;
+	vacrel->dead_tuples = dead_tuples;
 
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
@@ -3467,32 +3549,35 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, LVRelStats *vacrelstats,
  * context, but that won't be safe (see ExitParallelMode).
  */
 static void
-end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
-					int nindexes)
+end_parallel_vacuum(LVRelState *vacrel)
 {
-	int			i;
+	IndexBulkDeleteResult **indstats = vacrel->indstats;
+	LVParallelState *lps = vacrel->lps;
+	int			nindexes = vacrel->nindexes;
 
 	Assert(!IsParallelWorker());
 
 	/* Copy the updated statistics */
-	for (i = 0; i < nindexes; i++)
+	for (int idx = 0; idx < nindexes; idx++)
 	{
-		LVSharedIndStats *indstats = get_indstats(lps->lvshared, i);
+		LVSharedIndStats *shared_istat;
+
+		shared_istat = parallel_stats_for_idx(lps->lvshared, idx);
 
 		/*
 		 * Skip unused slot.  The statistics of this index are already stored
 		 * in local memory.
 		 */
-		if (indstats == NULL)
+		if (shared_istat == NULL)
 			continue;
 
-		if (indstats->updated)
+		if (shared_istat->updated)
 		{
-			stats[i] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-			memcpy(stats[i], &(indstats->stats), sizeof(IndexBulkDeleteResult));
+			indstats[idx] = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+			memcpy(indstats[idx], &(shared_istat->istat), sizeof(IndexBulkDeleteResult));
 		}
 		else
-			stats[i] = NULL;
+			indstats[idx] = NULL;
 	}
 
 	DestroyParallelContext(lps->pcxt);
@@ -3500,23 +3585,24 @@ end_parallel_vacuum(IndexBulkDeleteResult **stats, LVParallelState *lps,
 
 	/* Deactivate parallel vacuum */
 	pfree(lps);
-	lps = NULL;
+	vacrel->lps = NULL;
 }
 
-/* Return the Nth index statistics or NULL */
+/*
+ * Return shared memory statistics for index at offset 'getidx', if any
+ */
 static LVSharedIndStats *
-get_indstats(LVShared *lvshared, int n)
+parallel_stats_for_idx(LVShared *lvshared, int getidx)
 {
-	int			i;
 	char	   *p;
 
-	if (IndStatsIsNull(lvshared, n))
+	if (IndStatsIsNull(lvshared, getidx))
 		return NULL;
 
 	p = (char *) GetSharedIndStats(lvshared);
-	for (i = 0; i < n; i++)
+	for (int idx = 0; idx < getidx; idx++)
 	{
-		if (IndStatsIsNull(lvshared, i))
+		if (IndStatsIsNull(lvshared, idx))
 			continue;
 
 		p += sizeof(LVSharedIndStats);
@@ -3526,11 +3612,11 @@ get_indstats(LVShared *lvshared, int n)
 }
 
 /*
- * Returns true, if the given index can't participate in parallel index vacuum
- * or parallel index cleanup, false, otherwise.
+ * Returns false, if the given index can't participate in parallel index
+ * vacuum or parallel index cleanup
  */
 static bool
-skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
+parallel_processing_is_safe(Relation indrel, LVShared *lvshared)
 {
 	uint8		vacoptions = indrel->rd_indam->amparallelvacuumoptions;
 
@@ -3552,15 +3638,15 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 		 */
 		if (!lvshared->first_time &&
 			((vacoptions & VACUUM_OPTION_PARALLEL_COND_CLEANUP) != 0))
-			return true;
+			return false;
 	}
 	else if ((vacoptions & VACUUM_OPTION_PARALLEL_BULKDEL) == 0)
 	{
 		/* Skip if the index does not support parallel bulk deletion */
-		return true;
+		return false;
 	}
 
-	return false;
+	return true;
 }
 
 /*
@@ -3572,7 +3658,7 @@ skip_parallel_vacuum_index(Relation indrel, LVShared *lvshared)
 void
 parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 {
-	Relation	onerel;
+	Relation	rel;
 	Relation   *indrels;
 	LVShared   *lvshared;
 	LVDeadTuples *dead_tuples;
@@ -3580,7 +3666,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	WalUsage   *wal_usage;
 	int			nindexes;
 	char	   *sharedquery;
-	LVRelStats	vacrelstats;
+	LVRelState	vacrel;
 	ErrorContextCallback errcallback;
 
 	lvshared = (LVShared *) shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_SHARED,
@@ -3602,13 +3688,13 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	 * okay because the lock mode does not conflict among the parallel
 	 * workers.
 	 */
-	onerel = table_open(lvshared->relid, ShareUpdateExclusiveLock);
+	rel = table_open(lvshared->relid, ShareUpdateExclusiveLock);
 
 	/*
 	 * Open all indexes. indrels are sorted in order by OID, which should be
 	 * matched to the leader's one.
 	 */
-	vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &indrels);
+	vac_open_indexes(rel, RowExclusiveLock, &nindexes, &indrels);
 	Assert(nindexes > 0);
 
 	/* Set dead tuple space */
@@ -3626,24 +3712,27 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacuumSharedCostBalance = &(lvshared->cost_balance);
 	VacuumActiveNWorkers = &(lvshared->active_nworkers);
 
-	vacrelstats.indstats = (IndexBulkDeleteResult **)
+	vacrel.rel = rel;
+	vacrel.indrels = indrels;
+	vacrel.nindexes = nindexes;
+	vacrel.indstats = (IndexBulkDeleteResult **)
 		palloc0(nindexes * sizeof(IndexBulkDeleteResult *));
 
 	if (lvshared->maintenance_work_mem_worker > 0)
 		maintenance_work_mem = lvshared->maintenance_work_mem_worker;
 
 	/*
-	 * Initialize vacrelstats for use as error callback arg by parallel
-	 * worker.
+	 * Initialize vacrel for use as error callback arg by parallel worker.
 	 */
-	vacrelstats.relnamespace = get_namespace_name(RelationGetNamespace(onerel));
-	vacrelstats.relname = pstrdup(RelationGetRelationName(onerel));
-	vacrelstats.indname = NULL;
-	vacrelstats.phase = VACUUM_ERRCB_PHASE_UNKNOWN; /* Not yet processing */
+	vacrel.relnamespace = get_namespace_name(RelationGetNamespace(rel));
+	vacrel.relname = pstrdup(RelationGetRelationName(rel));
+	vacrel.indname = NULL;
+	vacrel.phase = VACUUM_ERRCB_PHASE_UNKNOWN;	/* Not yet processing */
+	vacrel.dead_tuples = dead_tuples;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = vacuum_error_callback;
-	errcallback.arg = &vacrelstats;
+	errcallback.arg = &vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
@@ -3651,8 +3740,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
-	parallel_vacuum_index(indrels, lvshared, dead_tuples, nindexes,
-						  &vacrelstats);
+	do_parallel_processing(&vacrel, lvshared);
 
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
@@ -3664,8 +3752,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	error_context_stack = errcallback.previous;
 
 	vac_close_indexes(nindexes, indrels, RowExclusiveLock);
-	table_close(onerel, ShareUpdateExclusiveLock);
-	pfree(vacrelstats.indstats);
+	table_close(rel, ShareUpdateExclusiveLock);
+	pfree(vacrel.indstats);
 }
 
 /*
@@ -3674,7 +3762,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 static void
 vacuum_error_callback(void *arg)
 {
-	LVRelStats *errinfo = arg;
+	LVRelState *errinfo = arg;
 
 	switch (errinfo->phase)
 	{
@@ -3736,28 +3824,29 @@ vacuum_error_callback(void *arg)
  * the current information which can be later restored via restore_vacuum_error_info.
  */
 static void
-update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_err_info, int phase,
-						 BlockNumber blkno, OffsetNumber offnum)
+update_vacuum_error_info(LVRelState *vacrel, LVSavedErrInfo *saved_vacrel,
+						 int phase, BlockNumber blkno, OffsetNumber offnum)
 {
-	if (saved_err_info)
+	if (saved_vacrel)
 	{
-		saved_err_info->offnum = errinfo->offnum;
-		saved_err_info->blkno = errinfo->blkno;
-		saved_err_info->phase = errinfo->phase;
+		saved_vacrel->offnum = vacrel->offnum;
+		saved_vacrel->blkno = vacrel->blkno;
+		saved_vacrel->phase = vacrel->phase;
 	}
 
-	errinfo->blkno = blkno;
-	errinfo->offnum = offnum;
-	errinfo->phase = phase;
+	vacrel->blkno = blkno;
+	vacrel->offnum = offnum;
+	vacrel->phase = phase;
 }
 
 /*
  * Restores the vacuum information saved via a prior call to update_vacuum_error_info.
  */
 static void
-restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info)
+restore_vacuum_error_info(LVRelState *vacrel,
+						  const LVSavedErrInfo *saved_vacrel)
 {
-	errinfo->blkno = saved_err_info->blkno;
-	errinfo->offnum = saved_err_info->offnum;
-	errinfo->phase = saved_err_info->phase;
+	vacrel->blkno = saved_vacrel->blkno;
+	vacrel->offnum = saved_vacrel->offnum;
+	vacrel->phase = saved_vacrel->phase;
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..9b5afa12ad 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -689,7 +689,7 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
  */
 IndexBulkDeleteResult *
 index_bulk_delete(IndexVacuumInfo *info,
-				  IndexBulkDeleteResult *stats,
+				  IndexBulkDeleteResult *istat,
 				  IndexBulkDeleteCallback callback,
 				  void *callback_state)
 {
@@ -698,7 +698,7 @@ index_bulk_delete(IndexVacuumInfo *info,
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambulkdelete);
 
-	return indexRelation->rd_indam->ambulkdelete(info, stats,
+	return indexRelation->rd_indam->ambulkdelete(info, istat,
 												 callback, callback_state);
 }
 
@@ -710,14 +710,14 @@ index_bulk_delete(IndexVacuumInfo *info,
  */
 IndexBulkDeleteResult *
 index_vacuum_cleanup(IndexVacuumInfo *info,
-					 IndexBulkDeleteResult *stats)
+					 IndexBulkDeleteResult *istat)
 {
 	Relation	indexRelation = info->index;
 
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(amvacuumcleanup);
 
-	return indexRelation->rd_indam->amvacuumcleanup(info, stats);
+	return indexRelation->rd_indam->amvacuumcleanup(info, istat);
 }
 
 /* ----------------
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 662aff04b4..25465b05dd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -617,7 +617,7 @@ Relation
 vacuum_open_relation(Oid relid, RangeVar *relation, bits32 options,
 					 bool verbose, LOCKMODE lmode)
 {
-	Relation	onerel;
+	Relation	rel;
 	bool		rel_lock = true;
 	int			elevel;
 
@@ -633,18 +633,18 @@ vacuum_open_relation(Oid relid, RangeVar *relation, bits32 options,
 	 * in non-blocking mode, before calling try_relation_open().
 	 */
 	if (!(options & VACOPT_SKIP_LOCKED))
-		onerel = try_relation_open(relid, lmode);
+		rel = try_relation_open(relid, lmode);
 	else if (ConditionalLockRelationOid(relid, lmode))
-		onerel = try_relation_open(relid, NoLock);
+		rel = try_relation_open(relid, NoLock);
 	else
 	{
-		onerel = NULL;
+		rel = NULL;
 		rel_lock = false;
 	}
 
 	/* if relation is opened, leave */
-	if (onerel)
-		return onerel;
+	if (rel)
+		return rel;
 
 	/*
 	 * Relation could not be opened, hence generate if possible a log
@@ -1726,8 +1726,8 @@ static bool
 vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 {
 	LOCKMODE	lmode;
-	Relation	onerel;
-	LockRelId	onerelid;
+	Relation	rel;
+	LockRelId	lockrelid;
 	Oid			toast_relid;
 	Oid			save_userid;
 	int			save_sec_context;
@@ -1792,11 +1792,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		AccessExclusiveLock : ShareUpdateExclusiveLock;
 
 	/* open the relation and get the appropriate lock on it */
-	onerel = vacuum_open_relation(relid, relation, params->options,
-								  params->log_min_duration >= 0, lmode);
+	rel = vacuum_open_relation(relid, relation, params->options,
+							   params->log_min_duration >= 0, lmode);
 
 	/* leave if relation could not be opened or locked */
-	if (!onerel)
+	if (!rel)
 	{
 		PopActiveSnapshot();
 		CommitTransactionCommand();
@@ -1811,11 +1811,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * changed in-between.  Make sure to only generate logs for VACUUM in this
 	 * case.
 	 */
-	if (!vacuum_is_relation_owner(RelationGetRelid(onerel),
-								  onerel->rd_rel,
+	if (!vacuum_is_relation_owner(RelationGetRelid(rel),
+								  rel->rd_rel,
 								  params->options & VACOPT_VACUUM))
 	{
-		relation_close(onerel, lmode);
+		relation_close(rel, lmode);
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 		return false;
@@ -1824,15 +1824,15 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	/*
 	 * Check that it's of a vacuumable relkind.
 	 */
-	if (onerel->rd_rel->relkind != RELKIND_RELATION &&
-		onerel->rd_rel->relkind != RELKIND_MATVIEW &&
-		onerel->rd_rel->relkind != RELKIND_TOASTVALUE &&
-		onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE &&
+		rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 	{
 		ereport(WARNING,
 				(errmsg("skipping \"%s\" --- cannot vacuum non-tables or special system tables",
-						RelationGetRelationName(onerel))));
-		relation_close(onerel, lmode);
+						RelationGetRelationName(rel))));
+		relation_close(rel, lmode);
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 		return false;
@@ -1845,9 +1845,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * warning here; it would just lead to chatter during a database-wide
 	 * VACUUM.)
 	 */
-	if (RELATION_IS_OTHER_TEMP(onerel))
+	if (RELATION_IS_OTHER_TEMP(rel))
 	{
-		relation_close(onerel, lmode);
+		relation_close(rel, lmode);
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 		return false;
@@ -1858,9 +1858,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * useful work is on their child partitions, which have been queued up for
 	 * us separately.
 	 */
-	if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		relation_close(onerel, lmode);
+		relation_close(rel, lmode);
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 		/* It's OK to proceed with ANALYZE on this table */
@@ -1877,14 +1877,14 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * because the lock manager knows that both lock requests are from the
 	 * same process.
 	 */
-	onerelid = onerel->rd_lockInfo.lockRelId;
-	LockRelationIdForSession(&onerelid, lmode);
+	lockrelid = rel->rd_lockInfo.lockRelId;
+	LockRelationIdForSession(&lockrelid, lmode);
 
 	/* Set index cleanup option based on reloptions if not yet */
 	if (params->index_cleanup == VACOPT_TERNARY_DEFAULT)
 	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_index_cleanup)
+		if (rel->rd_options == NULL ||
+			((StdRdOptions *) rel->rd_options)->vacuum_index_cleanup)
 			params->index_cleanup = VACOPT_TERNARY_ENABLED;
 		else
 			params->index_cleanup = VACOPT_TERNARY_DISABLED;
@@ -1893,8 +1893,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	/* Set truncate option based on reloptions if not yet */
 	if (params->truncate == VACOPT_TERNARY_DEFAULT)
 	{
-		if (onerel->rd_options == NULL ||
-			((StdRdOptions *) onerel->rd_options)->vacuum_truncate)
+		if (rel->rd_options == NULL ||
+			((StdRdOptions *) rel->rd_options)->vacuum_truncate)
 			params->truncate = VACOPT_TERNARY_ENABLED;
 		else
 			params->truncate = VACOPT_TERNARY_DISABLED;
@@ -1907,7 +1907,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 */
 	if ((params->options & VACOPT_PROCESS_TOAST) != 0 &&
 		(params->options & VACOPT_FULL) == 0)
-		toast_relid = onerel->rd_rel->reltoastrelid;
+		toast_relid = rel->rd_rel->reltoastrelid;
 	else
 		toast_relid = InvalidOid;
 
@@ -1918,7 +1918,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	 * unnecessary, but harmless, for lazy VACUUM.)
 	 */
 	GetUserIdAndSecContext(&save_userid, &save_sec_context);
-	SetUserIdAndSecContext(onerel->rd_rel->relowner,
+	SetUserIdAndSecContext(rel->rd_rel->relowner,
 						   save_sec_context | SECURITY_RESTRICTED_OPERATION);
 	save_nestlevel = NewGUCNestLevel();
 
@@ -1930,8 +1930,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		ClusterParams cluster_params = {0};
 
 		/* close relation before vacuuming, but hold lock until commit */
-		relation_close(onerel, NoLock);
-		onerel = NULL;
+		relation_close(rel, NoLock);
+		rel = NULL;
 
 		if ((params->options & VACOPT_VERBOSE) != 0)
 			cluster_params.options |= CLUOPT_VERBOSE;
@@ -1940,7 +1940,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		cluster_rel(relid, InvalidOid, &cluster_params);
 	}
 	else
-		table_relation_vacuum(onerel, params, vac_strategy);
+		table_relation_vacuum(rel, params, vac_strategy);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -1949,8 +1949,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* all done with this class, but hold lock until commit */
-	if (onerel)
-		relation_close(onerel, NoLock);
+	if (rel)
+		relation_close(rel, NoLock);
 
 	/*
 	 * Complete the transaction and free all temporary memory used.
@@ -1971,7 +1971,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	/*
 	 * Now release the session-level lock on the main table.
 	 */
-	UnlockRelationIdForSession(&onerelid, lmode);
+	UnlockRelationIdForSession(&lockrelid, lmode);
 
 	/* Report that we really did it. */
 	return true;
-- 
2.27.0

v10-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchapplication/octet-stream; name=v10-0003-Remove-tupgone-special-case-from-vacuumlazy.c.patchDownload

From 56f3281bf8c5e03c9a5729cceeca8ea97906266d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v10 3/5] Remove tupgone special case from vacuumlazy.c.

Retry the call to heap_prune_page() for the buffer being pruned and
vacuumed in rare cases where there is disagreement between the first
heap_prune_page() call and VACUUM's HeapTupleSatisfiesVacuum() call.
This was possible when a concurrently aborting transaction rendered a
live tuple dead in the tiny window between each check.  As a result,
VACUUM's definition of dead tuples (tuples that are to be deleted from
indexes during VACUUM) is simplified: it is always LP_DEAD stub line
pointers from the first scan of the heap.  Note that in general VACUUM
may not have actually done all the pruning that rendered tuples LP_DEAD.

This has the effect of decoupling index vacuuming (and heap page
vacuuming) from pruning during VACUUM's first heap pass.  The index
vacuum skipping performed by the INDEX_CLEANUP mechanism added by commit
a96c41f introduced one case where index vacuuming could be skipped, but
there are reasons to doubt that its approach was 100% robust.  Whereas
simply retrying pruning (and eliminating the tupgone steps entirely)
makes everything far simpler for heap vacuuming, and so far simpler in
general.

Heap vacuuming can now be thought of as conceptually similar to index
vacuuming and conceptually dissimilar to heap pruning.  Heap pruning now
has sole responsibility for anything involving the logical contents of
the database (e.g., managing transaction status information, recovery
conflicts, considering what to do with chains of tuples caused by
UPDATEs).  Whereas index vacuuming and heap vacuuming are now strictly
concerned with removing garbage tuples from a physical data structure
that backs the logical database.

This work enables INDEX_CLEANUP-style skipping of index vacuuming to be
pushed a lot further -- the decision can now be made dynamically (since
there is no question about leaving behind a dead tuple with storage due
to skipping the second heap pass/heap vacuuming).  An upcoming patch
from Masahiko Sawada will teach VACUUM to skip index vacuuming
dynamically, based on criteria involving the number of dead tuples.  The
only truly essential steps for VACUUM now all take place during the
first heap pass.  These are heap pruning and tuple freezing.  Everything
else is now an optional adjunct, at least in principle.

VACUUM can even change its mind about indexes (it can decide to give up
on deleting tuples from indexes).  There is no fundamental difference
between a VACUUM that decides to skip index vacuuming before it even
began, and a VACUUM that skips index vacuuming having already done a
certain amount of it.

Also remove XLOG_HEAP2_CLEANUP_INFO records.  These are no longer
necessary because we now rely entirely on heap pruning to take care of
recovery conflicts during VACUUM -- there is no longer any need to have
extra recovery conflicts due to the tupgone case allowing tuples that
still have storage (i.e. are not LP_DEAD) nevertheless being considered
dead tuples by VACUUM.  Note that heap vacuuming now uses exactly the
same strategy for recovery conflicts as index vacuuming.  Both
mechanisms now completely rely on heap pruning to generate all the
recovery conflicts that they require.

Also stop acquiring a super-exclusive lock for heap pages when they're
vacuumed during VACUUM's second heap pass.  A regular exclusive lock is
enough.  This is correct because heap page vacuuming is now strictly a
matter of setting the LP_DEAD line pointers to LP_UNUSED.  No other
backend can have a pointer to a tuple located in a pinned buffer that
can be invalidated by a concurrent heap page vacuum operation.  Note
that the page is no longer defragmented during heap page vacuuming,
because that is unsafe without a super-exclusive lock.

Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record
changes.

Credit for the idea of retrying pruning a page to avoid the tupgone case
goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
---
 src/include/access/heapam.h              |   2 +-
 src/include/access/heapam_xlog.h         |  41 ++--
 src/backend/access/gist/gistxlog.c       |   8 +-
 src/backend/access/hash/hash_xlog.c      |   8 +-
 src/backend/access/heap/heapam.c         | 205 ++++++++----------
 src/backend/access/heap/pruneheap.c      |  60 ++++--
 src/backend/access/heap/vacuumlazy.c     | 254 ++++++++++-------------
 src/backend/access/nbtree/nbtree.c       |   8 +-
 src/backend/access/rmgrdesc/heapdesc.c   |  32 +--
 src/backend/replication/logical/decode.c |   4 +-
 src/tools/pgindent/typedefs.list         |   4 +-
 11 files changed, 291 insertions(+), 335 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ceb625e13a..e63b49fc38 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -186,7 +186,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
-							bool report_stats, TransactionId *latestRemovedXid,
+							bool report_stats,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..27db48184e 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -51,9 +51,9 @@
  * these, too.
  */
 #define XLOG_HEAP2_REWRITE		0x00
-#define XLOG_HEAP2_CLEAN		0x10
-#define XLOG_HEAP2_FREEZE_PAGE	0x20
-#define XLOG_HEAP2_CLEANUP_INFO 0x30
+#define XLOG_HEAP2_PRUNE		0x10
+#define XLOG_HEAP2_VACUUM		0x20
+#define XLOG_HEAP2_FREEZE_PAGE	0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
@@ -227,7 +227,8 @@ typedef struct xl_heap_update
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_offnum) + sizeof(OffsetNumber))
 
 /*
- * This is what we need to know about vacuum page cleanup/redirect
+ * This is what we need to know about page pruning (both during VACUUM and
+ * during opportunistic pruning)
  *
  * The array of OffsetNumbers following the fixed part of the record contains:
  *	* for each redirected item: the item offset, then the offset redirected to
@@ -236,29 +237,32 @@ typedef struct xl_heap_update
  * The total number of OffsetNumbers is therefore 2*nredirected+ndead+nunused.
  * Note that nunused is not explicitly stored, but may be found by reference
  * to the total record length.
+ *
+ * Requires a super-exclusive lock.
  */
-typedef struct xl_heap_clean
+typedef struct xl_heap_prune
 {
 	TransactionId latestRemovedXid;
 	uint16		nredirected;
 	uint16		ndead;
 	/* OFFSET NUMBERS are in the block reference 0 */
-} xl_heap_clean;
+} xl_heap_prune;
 
-#define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16))
+#define SizeOfHeapPrune (offsetof(xl_heap_prune, ndead) + sizeof(uint16))
 
 /*
- * Cleanup_info is required in some cases during a lazy VACUUM.
- * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid()
- * see vacuumlazy.c for full explanation
+ * The vacuum page record is similar to the prune record, but can only mark
+ * already dead items as unused
+ *
+ * Used by heap vacuuming only.  Does not require a super-exclusive lock.
  */
-typedef struct xl_heap_cleanup_info
+typedef struct xl_heap_vacuum
 {
-	RelFileNode node;
-	TransactionId latestRemovedXid;
-} xl_heap_cleanup_info;
+	uint16		nunused;
+	/* OFFSET NUMBERS are in the block reference 0 */
+} xl_heap_vacuum;
 
-#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info))
+#define SizeOfHeapVacuum (offsetof(xl_heap_vacuum, nunused) + sizeof(uint16))
 
 /* flags for infobits_set */
 #define XLHL_XMAX_IS_MULTI		0x01
@@ -397,13 +401,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
-extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
-										TransactionId latestRemovedXid);
-extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
-								 OffsetNumber *redirected, int nredirected,
-								 OffsetNumber *nowdead, int ndead,
-								 OffsetNumber *nowunused, int nunused,
-								 TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
 								  TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
 								  int ntuples);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1c80eae044..6464cb9281 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -184,10 +184,10 @@ gistRedoDeleteRecord(XLogReaderState *record)
 	 *
 	 * GiST delete records can conflict with standby queries.  You might think
 	 * that vacuum records would conflict as well, but we've handled that
-	 * already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual gist vacuum records on that index.
+	 * already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned by
+	 * the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual gist vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 02d9e6cdfd..af35a991fc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -992,10 +992,10 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	 * Hash index records that are marked as LP_DEAD and being removed during
 	 * hash index tuple insertion can conflict with standby queries. You might
 	 * think that vacuum records would conflict as well, but we've handled
-	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
-	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
-	 * just once when that arrives.  After that we know that no conflicts
-	 * exist from individual hash index vacuum records on that index.
+	 * that already.  XLOG_HEAP2_PRUNE records provide the highest xid cleaned
+	 * by the vacuum of the heap and so we can resolve any conflicts just once
+	 * when that arrives.  After that we know that no conflicts exist from
+	 * individual hash index vacuum records on that index.
 	 */
 	if (InHotStandby)
 	{
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 595310ba1b..9cbc161d7a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7538,7 +7538,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 			 * must have considered the original tuple header as part of
 			 * generating its own latestRemovedXid value.
 			 *
-			 * Relying on XLOG_HEAP2_CLEAN records like this is the same
+			 * Relying on XLOG_HEAP2_PRUNE records like this is the same
 			 * strategy that index vacuuming uses in all cases.  Index VACUUM
 			 * WAL records don't even have a latestRemovedXid field of their
 			 * own for this reason.
@@ -7957,88 +7957,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
 	return nblocksfavorable;
 }
 
-/*
- * Perform XLogInsert to register a heap cleanup info message. These
- * messages are sent once per VACUUM and are required because
- * of the phasing of removal operations during a lazy VACUUM.
- * see comments for vacuum_log_cleanup_info().
- */
-XLogRecPtr
-log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
-{
-	xl_heap_cleanup_info xlrec;
-	XLogRecPtr	recptr;
-
-	xlrec.node = rnode;
-	xlrec.latestRemovedXid = latestRemovedXid;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapCleanupInfo);
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO);
-
-	return recptr;
-}
-
-/*
- * Perform XLogInsert for a heap-clean operation.  Caller must already
- * have modified the buffer and marked it dirty.
- *
- * Note: prior to Postgres 8.3, the entries in the nowunused[] array were
- * zero-based tuple indexes.  Now they are one-based like other uses
- * of OffsetNumber.
- *
- * We also include latestRemovedXid, which is the greatest XID present in
- * the removed tuples. That allows recovery processing to cancel or wait
- * for long standby queries that can still see these tuples.
- */
-XLogRecPtr
-log_heap_clean(Relation reln, Buffer buffer,
-			   OffsetNumber *redirected, int nredirected,
-			   OffsetNumber *nowdead, int ndead,
-			   OffsetNumber *nowunused, int nunused,
-			   TransactionId latestRemovedXid)
-{
-	xl_heap_clean xlrec;
-	XLogRecPtr	recptr;
-
-	/* Caller should not call me on a non-WAL-logged relation */
-	Assert(RelationNeedsWAL(reln));
-
-	xlrec.latestRemovedXid = latestRemovedXid;
-	xlrec.nredirected = nredirected;
-	xlrec.ndead = ndead;
-
-	XLogBeginInsert();
-	XLogRegisterData((char *) &xlrec, SizeOfHeapClean);
-
-	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-	/*
-	 * The OffsetNumber arrays are not actually in the buffer, but we pretend
-	 * that they are.  When XLogInsert stores the whole buffer, the offset
-	 * arrays need not be stored too.  Note that even if all three arrays are
-	 * empty, we want to expose the buffer as a candidate for whole-page
-	 * storage, since this record type implies a defragmentation operation
-	 * even if no line pointers changed state.
-	 */
-	if (nredirected > 0)
-		XLogRegisterBufData(0, (char *) redirected,
-							nredirected * sizeof(OffsetNumber) * 2);
-
-	if (ndead > 0)
-		XLogRegisterBufData(0, (char *) nowdead,
-							ndead * sizeof(OffsetNumber));
-
-	if (nunused > 0)
-		XLogRegisterBufData(0, (char *) nowunused,
-							nunused * sizeof(OffsetNumber));
-
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEAN);
-
-	return recptr;
-}
-
 /*
  * Perform XLogInsert for a heap-freeze operation.  Caller must have already
  * modified the buffer and marked it dirty.
@@ -8510,34 +8428,15 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed,
 }
 
 /*
- * Handles CLEANUP_INFO
+ * Handles XLOG_HEAP2_PRUNE record type.
+ *
+ * Acquires a super-exclusive lock.
  */
 static void
-heap_xlog_cleanup_info(XLogReaderState *record)
-{
-	xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record);
-
-	if (InHotStandby)
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
-
-	/*
-	 * Actual operation is a no-op. Record type exists to provide a means for
-	 * conflict processing to occur before we begin index vacuum actions. see
-	 * vacuumlazy.c and also comments in btvacuumpage()
-	 */
-
-	/* Backup blocks are not used in cleanup_info records */
-	Assert(!XLogRecHasAnyBlockRefs(record));
-}
-
-/*
- * Handles XLOG_HEAP2_CLEAN record type
- */
-static void
-heap_xlog_clean(XLogReaderState *record)
+heap_xlog_prune(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
+	xl_heap_prune *xlrec = (xl_heap_prune *) XLogRecGetData(record);
 	Buffer		buffer;
 	RelFileNode rnode;
 	BlockNumber blkno;
@@ -8548,12 +8447,8 @@ heap_xlog_clean(XLogReaderState *record)
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
 	 * no queries running for which the removed tuples are still visible.
-	 *
-	 * Not all HEAP2_CLEAN records remove tuples with xids, so we only want to
-	 * conflict on the records that cause MVCC failures for user queries. If
-	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
-	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
+	if (InHotStandby)
 		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
@@ -8606,7 +8501,7 @@ heap_xlog_clean(XLogReaderState *record)
 		UnlockReleaseBuffer(buffer);
 
 		/*
-		 * After cleaning records from a page, it's useful to update the FSM
+		 * After pruning records from a page, it's useful to update the FSM
 		 * about it, as it may cause the page become target for insertions
 		 * later even if vacuum decides not to visit it (which is possible if
 		 * gets marked all-visible.)
@@ -8618,6 +8513,80 @@ heap_xlog_clean(XLogReaderState *record)
 	}
 }
 
+/*
+ * Handles XLOG_HEAP2_VACUUM record type.
+ *
+ * Acquires an exclusive lock only.
+ */
+static void
+heap_xlog_vacuum(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_heap_vacuum *xlrec = (xl_heap_vacuum *) XLogRecGetData(record);
+	Buffer		buffer;
+	BlockNumber blkno;
+	XLogRedoAction action;
+
+	/*
+	 * If we have a full-page image, restore it	(without using a cleanup lock)
+	 * and we're done.
+	 */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, false,
+										   &buffer);
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *nowunused;
+		Size		datalen;
+		OffsetNumber *offnum;
+
+		nowunused = (OffsetNumber *) XLogRecGetBlockData(record, 0, &datalen);
+
+		/* Shouldn't be a record unless there's something to do */
+		Assert(xlrec->nunused > 0);
+
+		/* Update all now-unused line pointers */
+		offnum = nowunused;
+		for (int i = 0; i < xlrec->nunused; i++)
+		{
+			OffsetNumber off = *offnum++;
+			ItemId		lp = PageGetItemId(page, off);
+
+			Assert(ItemIdIsDead(lp) && !ItemIdHasStorage(lp));
+			ItemIdSetUnused(lp);
+		}
+
+		/*
+		 * Update the page's hint bit about whether it has free pointers
+		 */
+		PageSetHasFreeLinePointers(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
+	if (BufferIsValid(buffer))
+	{
+		Size		freespace = PageGetHeapFreeSpace(BufferGetPage(buffer));
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * After vacuuming LP_DEAD items from a page, it's useful to update
+		 * the FSM about it, as it may cause the page become target for
+		 * insertions later even if vacuum decides not to visit it (which is
+		 * possible if gets marked all-visible.)
+		 *
+		 * Do this regardless of a full-page image being applied, since the
+		 * FSM data is not in the page anyway.
+		 */
+		XLogRecordPageWithFreeSpace(rnode, blkno, freespace);
+	}
+}
+
 /*
  * Replay XLOG_HEAP2_VISIBLE record.
  *
@@ -9722,15 +9691,15 @@ heap2_redo(XLogReaderState *record)
 
 	switch (info & XLOG_HEAP_OPMASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			heap_xlog_clean(record);
+		case XLOG_HEAP2_PRUNE:
+			heap_xlog_prune(record);
+			break;
+		case XLOG_HEAP2_VACUUM:
+			heap_xlog_vacuum(record);
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			heap_xlog_freeze_page(record);
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			heap_xlog_cleanup_info(record);
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(record);
 			break;
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 8bb38d6406..f75502ca2c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,13 +182,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
-			TransactionId ignore = InvalidTransactionId;	/* return value not
-															 * needed */
-
 			/* OK to prune */
 			(void) heap_page_prune(relation, buffer, vistest,
 								   limited_xmin, limited_ts,
-								   true, &ignore, NULL);
+								   true, NULL);
 		}
 
 		/* And release buffer lock */
@@ -213,8 +210,6 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * send its own new total to pgstats, and we don't want this delta applied
  * on top of that.)
  *
- * Sets latestRemovedXid for caller on return.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
@@ -225,7 +220,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				bool report_stats, TransactionId *latestRemovedXid,
+				bool report_stats,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -251,7 +246,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	prstate.old_snap_xmin = old_snap_xmin;
 	prstate.old_snap_ts = old_snap_ts;
 	prstate.old_snap_used = false;
-	prstate.latestRemovedXid = *latestRemovedXid;
+	prstate.latestRemovedXid = InvalidTransactionId;
 	prstate.nredirected = prstate.ndead = prstate.nunused = 0;
 	memset(prstate.marked, 0, sizeof(prstate.marked));
 
@@ -318,17 +313,41 @@ heap_page_prune(Relation relation, Buffer buffer,
 		MarkBufferDirty(buffer);
 
 		/*
-		 * Emit a WAL XLOG_HEAP2_CLEAN record showing what we did
+		 * Emit a WAL XLOG_HEAP2_PRUNE record showing what we did
 		 */
 		if (RelationNeedsWAL(relation))
 		{
+			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_clean(relation, buffer,
-									prstate.redirected, prstate.nredirected,
-									prstate.nowdead, prstate.ndead,
-									prstate.nowunused, prstate.nunused,
-									prstate.latestRemovedXid);
+			xlrec.latestRemovedXid = prstate.latestRemovedXid;
+			xlrec.nredirected = prstate.nredirected;
+			xlrec.ndead = prstate.ndead;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHeapPrune);
+
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			/*
+			 * The OffsetNumber arrays are not actually in the buffer, but we
+			 * pretend that they are.  When XLogInsert stores the whole
+			 * buffer, the offset arrays need not be stored too.
+			 */
+			if (prstate.nredirected > 0)
+				XLogRegisterBufData(0, (char *) prstate.redirected,
+									prstate.nredirected *
+									sizeof(OffsetNumber) * 2);
+
+			if (prstate.ndead > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowdead,
+									prstate.ndead * sizeof(OffsetNumber));
+
+			if (prstate.nunused > 0)
+				XLogRegisterBufData(0, (char *) prstate.nowunused,
+									prstate.nunused * sizeof(OffsetNumber));
+
+			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
@@ -363,8 +382,6 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (report_stats && ndeleted > prstate.ndead)
 		pgstat_update_heap_dead_tuples(relation, ndeleted - prstate.ndead);
 
-	*latestRemovedXid = prstate.latestRemovedXid;
-
 	/*
 	 * XXX Should we update the FSM information of this page ?
 	 *
@@ -809,12 +826,8 @@ heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum)
 
 /*
  * Perform the actual page changes needed by heap_page_prune.
- * It is expected that the caller has suitable pin and lock on the
- * buffer, and is inside a critical section.
- *
- * This is split out because it is also used by heap_xlog_clean()
- * to replay the WAL record when needed after a crash.  Note that the
- * arguments are identical to those of log_heap_clean().
+ * It is expected that the caller has a super-exclusive lock on the
+ * buffer.
  */
 void
 heap_page_prune_execute(Buffer buffer,
@@ -826,6 +839,9 @@ heap_page_prune_execute(Buffer buffer,
 	OffsetNumber *offnum;
 	int			i;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(nredirected > 0 || ndead > 0 || nunused > 0);
+
 	/* Update all redirected line pointers */
 	offnum = redirected;
 	for (i = 0; i < nredirected; i++)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5dc9ab404b..a0db90b43f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -310,7 +310,6 @@ typedef struct LVRelState
 	/* rel's initial relfrozenxid and relminmxid */
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
-	TransactionId latestRemovedXid;
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
@@ -392,8 +391,7 @@ static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
-							LVPagePruneState *prunestate,
-							VacOptTernaryValue index_cleanup);
+							LVPagePruneState *prunestate);
 static void lazy_vacuum(LVRelState *vacrel);
 static void lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -556,7 +554,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->old_live_tuples = rel->rd_rel->reltuples;
 	vacrel->relfrozenxid = rel->rd_rel->relfrozenxid;
 	vacrel->relminmxid = rel->rd_rel->relminmxid;
-	vacrel->latestRemovedXid = InvalidTransactionId;
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
@@ -798,40 +795,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 }
 
-/*
- * For Hot Standby we need to know the highest transaction id that will
- * be removed by any change. VACUUM proceeds in a number of passes so
- * we need to consider how each pass operates. The first phase runs
- * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it
- * progresses - these will have a latestRemovedXid on each record.
- * In some cases this removes all of the tuples to be removed, though
- * often we have dead tuples with index pointers so we must remember them
- * for removal in phase 3. Index records for those rows are removed
- * in phase 2 and index blocks do not have MVCC information attached.
- * So before we can allow removal of any index tuples we need to issue
- * a WAL record containing the latestRemovedXid of rows that will be
- * removed in phase three. This allows recovery queries to block at the
- * correct place, i.e. before phase two, rather than during phase three
- * which would be after the rows have become inaccessible.
- */
-static void
-vacuum_log_cleanup_info(LVRelState *vacrel)
-{
-	/*
-	 * Skip this for relations for which no WAL is to be written, or if we're
-	 * not trying to support archive recovery.
-	 */
-	if (!RelationNeedsWAL(vacrel->rel) || !XLogIsNeeded())
-		return;
-
-	/*
-	 * No need to write the record at all unless it contains a valid value
-	 */
-	if (TransactionIdIsValid(vacrel->latestRemovedXid))
-		(void) log_heap_cleanup_info(vacrel->rel->rd_node,
-									 vacrel->latestRemovedXid);
-}
-
 /*
  *	lazy_scan_heap() -- scan an open heap relation
  *
@@ -1318,8 +1281,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * pruning.  Considers freezing XIDs in tuple headers from items not
 		 * made LP_DEAD by pruning.
 		 */
-		lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate,
-						params->index_cleanup);
+		lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate);
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1606,6 +1568,27 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
  * Caller must hold pin and buffer cleanup lock on the buffer.
+ *
+ * Prior to PostgreSQL 14 there were very rare cases where heap_page_prune()
+ * was allowed to disagree with our HeapTupleSatisfiesVacuum() call about
+ * whether or not a tuple should be considered DEAD.  This happened when an
+ * inserting transaction concurrently aborted (after our heap_page_prune()
+ * call, before our HeapTupleSatisfiesVacuum() call).  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs are with respect to the OldestXid cutoff.
+ *
+ * This created rare, hard to test cases -- exceptions to the general rule
+ * that TIDs that we enter into the dead_tuples array are in fact just LP_DEAD
+ * items without storage.  We had rather a lot of complexity to account for
+ * tuples that were dead, but still had storage, and so still had a tuple
+ * header with XIDs that were not quite unambiguously after the FreezeLimit
+ * limit.
+ *
+ * The approach we take here now is a little crude, but it's also simple and
+ * robust: we restart pruning when the race condition is detected.  This
+ * guarantees that any items that make it into the dead_tuples array are
+ * simple LP_DEAD line pointers, and that every item with tuple storage is
+ * considered as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1613,14 +1596,14 @@ lazy_scan_prune(LVRelState *vacrel,
 				BlockNumber blkno,
 				Page page,
 				GlobalVisState *vistest,
-				LVPagePruneState *prunestate,
-				VacOptTernaryValue index_cleanup)
+				LVPagePruneState *prunestate)
 {
 	Relation	rel = vacrel->rel;
 	OffsetNumber offnum,
 				maxoff;
 	ItemId		itemid;
 	HeapTupleData tuple;
+	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
 				new_dead_tuples,
@@ -1632,6 +1615,8 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	maxoff = PageGetMaxOffsetNumber(page);
 
+retry:
+
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
@@ -1650,7 +1635,6 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vistest,
 									 InvalidTransactionId, 0, false,
-									 &vacrel->latestRemovedXid,
 									 &vacrel->offnum);
 
 	/*
@@ -1669,7 +1653,6 @@ lazy_scan_prune(LVRelState *vacrel,
 		 offnum = OffsetNumberNext(offnum))
 	{
 		bool		tuple_totally_frozen;
-		bool		tupgone = false;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1720,6 +1703,17 @@ lazy_scan_prune(LVRelState *vacrel,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(rel);
 
+		/*
+		 * DEAD tuples are almost always pruned into LP_DEAD line pointers by
+		 * heap_page_prune(), but it's possible that the tuple state changed
+		 * since heap_page_prune() looked.  Handle that here by restarting.
+		 * (See comments at the top of function for a full explanation.)
+		 */
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+		if (unlikely(res == HEAPTUPLE_DEAD))
+			goto retry;
+
 		/*
 		 * The criteria for counting a tuple as live in this block need to
 		 * match what analyze.c's acquire_sample_rows() does, otherwise VACUUM
@@ -1730,42 +1724,8 @@ lazy_scan_prune(LVRelState *vacrel,
 		 * VACUUM can't run inside a transaction block, which makes some cases
 		 * impossible (e.g. in-progress insert from the same transaction).
 		 */
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (res)
 		{
-			case HEAPTUPLE_DEAD:
-
-				/*
-				 * Ordinarily, DEAD tuples would have been removed by
-				 * heap_page_prune(), but it's possible that the tuple state
-				 * changed since heap_page_prune() looked.  In particular an
-				 * INSERT_IN_PROGRESS tuple could have changed to DEAD if the
-				 * inserter aborted.  So this cannot be considered an error
-				 * condition.
-				 *
-				 * If the tuple is HOT-updated then it must only be removed by
-				 * a prune operation; so we keep it just as if it were
-				 * RECENTLY_DEAD.  Also, if it's a heap-only tuple, we choose
-				 * to keep it, because it'll be a lot cheaper to get rid of it
-				 * in the next pruning pass than to treat it like an indexed
-				 * tuple. Finally, if index cleanup is disabled, the second
-				 * heap pass will not execute, and the tuple will not get
-				 * removed, so we must treat it like any other dead tuple that
-				 * we choose to keep.
-				 *
-				 * If this were to happen for a tuple that actually needed to
-				 * be deleted, we'd be in trouble, because it'd possibly leave
-				 * a tuple below the relation's xmin horizon alive.
-				 * heap_prepare_freeze_tuple() is prepared to detect that case
-				 * and abort the transaction, preventing corruption.
-				 */
-				if (HeapTupleIsHotUpdated(&tuple) ||
-					HeapTupleIsHeapOnly(&tuple) ||
-					index_cleanup == VACOPT_TERNARY_DISABLED)
-					new_dead_tuples++;
-				else
-					tupgone = true; /* we can delete the tuple */
-				prunestate->all_visible = false;
-				break;
 			case HEAPTUPLE_LIVE:
 
 				/*
@@ -1845,46 +1805,32 @@ lazy_scan_prune(LVRelState *vacrel,
 				break;
 		}
 
-		if (tupgone)
+		/*
+		 * Non-removable tuple (i.e. tuple with storage).
+		 *
+		 * Check tuple left behind after pruning to see if needs to be frozen
+		 * now.
+		 */
+		num_tuples++;
+		prunestate->hastup = true;
+		if (heap_prepare_freeze_tuple(tuple.t_data,
+									  vacrel->relfrozenxid,
+									  vacrel->relminmxid,
+									  vacrel->FreezeLimit,
+									  vacrel->MultiXactCutoff,
+									  &frozen[nfrozen],
+									  &tuple_totally_frozen))
 		{
-			/* Pretend that this is an LP_DEAD item  */
-			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
-
-			/* But remember it for XLOG_HEAP2_CLEANUP_INFO record */
-			HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data,
-												   &vacrel->latestRemovedXid);
+			/* Will execute freeze below */
+			frozen[nfrozen++].offset = offnum;
 		}
-		else
-		{
-			/*
-			 * Non-removable tuple (i.e. tuple with storage).
-			 *
-			 * Check tuple left behind after pruning to see if needs to be frozen
-			 * now.
-			 */
-			num_tuples++;
-			prunestate->hastup = true;
-			if (heap_prepare_freeze_tuple(tuple.t_data,
-										  vacrel->relfrozenxid,
-										  vacrel->relminmxid,
-										  vacrel->FreezeLimit,
-										  vacrel->MultiXactCutoff,
-										  &frozen[nfrozen],
-										  &tuple_totally_frozen))
-			{
-				/* Will execute freeze below */
-				frozen[nfrozen++].offset = offnum;
-			}
 
-			/*
-			 * If tuple is not frozen (and not about to become frozen) then caller
-			 * had better not go on to set this page's VM bit
-			 */
-			if (!tuple_totally_frozen)
-				prunestate->all_frozen = false;
-		}
+		/*
+		 * If tuple is not frozen (and not about to become frozen) then caller
+		 * had better not go on to set this page's VM bit
+		 */
+		if (!tuple_totally_frozen)
+			prunestate->all_frozen = false;
 	}
 
 	/*
@@ -1895,9 +1841,6 @@ lazy_scan_prune(LVRelState *vacrel,
 	 *
 	 * Add page level counters to caller's counts, and then actually process
 	 * LP_DEAD and LP_NORMAL items.
-	 *
-	 * TODO: Remove tupgone logic entirely in next commit -- we shouldn't have
-	 * to pretend that DEAD items are LP_DEAD items.
 	 */
 	vacrel->offnum = InvalidOffsetNumber;
 
@@ -2067,9 +2010,6 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
-	/* Log cleanup info before we touch indexes */
-	vacuum_log_cleanup_info(vacrel);
-
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2092,6 +2032,14 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
 	}
 
+	/*
+	 * We delete all LP_DEAD items from the first heap pass in all indexes on
+	 * each call here.  This makes the next call to lazy_vacuum_heap_rel()
+	 * safe.
+	 */
+	Assert(vacrel->num_index_scans > 0 ||
+		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
+
 	/* Increase and report the number of index scans */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
@@ -2101,9 +2049,9 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks dead tuples as unused and compacts out free space on
- * their pages.  Pages not having dead tuples recorded from lazy_scan_heap are
- * not visited at all.
+ * This routine marks LP_DEAD items in vacrel->dead_tuples array as LP_UNUSED.
+ * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
+ * at all.
  *
  * Note: the reason for doing this as a second pass is we cannot remove the
  * tuples until we've removed their index entries, and we want to process
@@ -2148,16 +2096,11 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			ReleaseBuffer(buf);
-			++tupindex;
-			continue;
-		}
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		tupindex = lazy_vacuum_heap_page(vacrel, tblk, buf, tupindex,
 										 &vmbuffer);
 
-		/* Now that we've compacted the page, record its available space */
+		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
@@ -2175,6 +2118,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vmbuffer = InvalidBuffer;
 	}
 
+	/*
+	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
+	 * the second heap pass.  No more, no less.
+	 */
+	Assert(vacrel->num_index_scans > 1 ||
+		   (tupindex == vacrel->lpdead_items &&
+			vacuumed_pages == vacrel->lpdead_item_pages));
+
 	ereport(elevel,
 			(errmsg("\"%s\": removed %d dead item identifiers in %u pages",
 					vacrel->relname, tupindex, vacuumed_pages),
@@ -2185,14 +2136,22 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free dead tuples on a page
- *						  and repair its fragmentation.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
+ *						  vacrel->dead_tuples array.
  *
- * Caller must hold pin and buffer cleanup lock on the buffer.
+ * Caller must have an exclusive buffer lock on the buffer (though a
+ * super-exclusive lock is also acceptable).
  *
  * tupindex is the index in vacrel->dead_tuples of the first dead tuple for
  * this page.  We assume the rest follow sequentially.  The return value is
  * the first tupindex after the tuples of this page.
+ *
+ * Prior to PostgreSQL 14 there were rare cases where this routine had to set
+ * tuples with storage to unused.  These days it is strictly responsible for
+ * marking LP_DEAD stub line pointers as unused.  This only happens for those
+ * LP_DEAD items on the page that were determined to be LP_DEAD items back
+ * when the same page was visited by lazy_scan_prune() (i.e. those whose TID
+ * was recorded in the dead_tuples array).
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
@@ -2228,11 +2187,15 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 			break;				/* past end of tuples for this block */
 		toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]);
 		itemid = PageGetItemId(page, toff);
+
+		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
 		ItemIdSetUnused(itemid);
 		unused[uncnt++] = toff;
 	}
 
-	PageRepairFragmentation(page);
+	Assert(uncnt > 0);
+
+	PageSetHasFreeLinePointers(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
@@ -2242,12 +2205,19 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	/* XLOG stuff */
 	if (RelationNeedsWAL(vacrel->rel))
 	{
+		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		recptr = log_heap_clean(vacrel->rel, buffer,
-								NULL, 0, NULL, 0,
-								unused, uncnt,
-								vacrel->latestRemovedXid);
+		xlrec.nunused = uncnt;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
+
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+
+		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
+
 		PageSetLSN(page, recptr);
 	}
 
@@ -2260,10 +2230,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that we have removed the dead tuples from the page, once again
+	 * Now that we have removed the LD_DEAD items from the page, once again
 	 * check if the page has become all-visible.  The page is already marked
 	 * dirty, exclusively locked, and, if needed, a full page image has been
-	 * emitted in the log_heap_clean() above.
+	 * emitted.
 	 */
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 9282c9ea22..1360ab80c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1213,10 +1213,10 @@ backtrack:
 				 * as long as the callback function only considers whether the
 				 * index tuple refers to pre-cutoff heap tuples that were
 				 * certainly already pruned away during VACUUM's initial heap
-				 * scan by the time we get here. (heapam's XLOG_HEAP2_CLEAN
-				 * and XLOG_HEAP2_CLEANUP_INFO records produce conflicts using
-				 * a latestRemovedXid value for the pointed-to heap tuples, so
-				 * there is no need to produce our own conflict now.)
+				 * scan by the time we get here. (heapam's XLOG_HEAP2_PRUNE
+				 * records produce conflicts using a latestRemovedXid value
+				 * for the pointed-to heap tuples, so there is no need to
+				 * produce our own conflict now.)
 				 *
 				 * Backends with snapshots acquired after a VACUUM starts but
 				 * before it finishes could have visibility cutoff with a
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..f8b4fb901b 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -121,11 +121,21 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
 	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_CLEAN)
+	if (info == XLOG_HEAP2_PRUNE)
 	{
-		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
+		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
+		/* XXX Should display implicit 'nunused' field, too */
+		appendStringInfo(buf, "latestRemovedXid %u nredirected %u ndead %u",
+						 xlrec->latestRemovedXid,
+						 xlrec->nredirected,
+						 xlrec->ndead);
+	}
+	else if (info == XLOG_HEAP2_VACUUM)
+	{
+		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
+
+		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
 	{
@@ -134,12 +144,6 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_CLEANUP_INFO)
-	{
-		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
-
-		appendStringInfo(buf, "latestRemovedXid %u", xlrec->latestRemovedXid);
-	}
 	else if (info == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
@@ -229,15 +233,15 @@ heap2_identify(uint8 info)
 
 	switch (info & ~XLR_INFO_MASK)
 	{
-		case XLOG_HEAP2_CLEAN:
-			id = "CLEAN";
+		case XLOG_HEAP2_PRUNE:
+			id = "PRUNE";
+			break;
+		case XLOG_HEAP2_VACUUM:
+			id = "VACUUM";
 			break;
 		case XLOG_HEAP2_FREEZE_PAGE:
 			id = "FREEZE_PAGE";
 			break;
-		case XLOG_HEAP2_CLEANUP_INFO:
-			id = "CLEANUP_INFO";
-			break;
 		case XLOG_HEAP2_VISIBLE:
 			id = "VISIBLE";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 97be4b0f23..9aab713684 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -484,8 +484,8 @@ DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * interested in.
 			 */
 		case XLOG_HEAP2_FREEZE_PAGE:
-		case XLOG_HEAP2_CLEAN:
-		case XLOG_HEAP2_CLEANUP_INFO:
+		case XLOG_HEAP2_PRUNE:
+		case XLOG_HEAP2_VACUUM:
 		case XLOG_HEAP2_VISIBLE:
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0a75dccb93 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3554,8 +3554,6 @@ xl_hash_split_complete
 xl_hash_squeeze_page
 xl_hash_update_meta_page
 xl_hash_vacuum_one_page
-xl_heap_clean
-xl_heap_cleanup_info
 xl_heap_confirm
 xl_heap_delete
 xl_heap_freeze_page
@@ -3567,9 +3565,11 @@ xl_heap_lock
 xl_heap_lock_updated
 xl_heap_multi_insert
 xl_heap_new_cid
+xl_heap_prune
 xl_heap_rewrite_mapping
 xl_heap_truncate
 xl_heap_update
+xl_heap_vacuum
 xl_heap_visible
 xl_invalid_page
 xl_invalid_page_key
-- 
2.27.0

v10-0004-Truncate-line-pointer-array-during-VACUUM.patchapplication/octet-stream; name=v10-0004-Truncate-line-pointer-array-during-VACUUM.patchDownload

From 4d9d3ef3e6870228304a478a06f75ecd629f2db1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 30 Mar 2021 19:43:06 -0700
Subject: [PATCH v10 4/5] Truncate line pointer array during VACUUM.

Truncate each heap page's line pointer array when a contiguous group of
LP_UNUSED item pointers appears at the end of the array.  This happens
during VACUUM's second pass over the heap.  In practice most affected
LP_UNUSED line pointers are truncated away at the same point that VACUUM
marks them LP_UNUSED (from LP_DEAD).

This is particularly helpful with queue-like workloads that have
successive related range DELETEs and multi-row INSERT queries.  VACUUM
can reclaim all of the space on each page, making it more likely that
space utilization will be stable over time (a lower heap fill factor is
still essential, though).

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
---
 src/include/storage/bufpage.h        |  1 +
 src/backend/access/heap/heapam.c     | 15 +++--
 src/backend/access/heap/pruneheap.c  |  1 +
 src/backend/access/heap/vacuumlazy.c | 16 +++++-
 src/backend/storage/page/bufpage.c   | 82 +++++++++++++++++++++++++++-
 5 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 359b749f7f..c86ccdaf60 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -441,6 +441,7 @@ extern Page PageGetTempPageCopy(Page page);
 extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
+extern void PageTruncateLinePointerArray(Page page);
 extern Size PageGetFreeSpace(Page page);
 extern Size PageGetFreeSpaceForMultipleTuples(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9cbc161d7a..db84c6f882 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -635,8 +635,15 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
+			/* 
+			 * The previous returned tuple may have been vacuumed since the
+			 * previous scan when we use a non-MVCC snapshot, so we must
+			 * re-establish the lineoff <= PageGetMaxOffsetNumber(dp)
+			 * invariant
+			 */
 			lineoff =			/* previous offnum */
-				OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self)));
+				Min(lines,
+					OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self))));
 		}
 		/* page and lineoff now reference the physically previous tid */
 
@@ -8556,10 +8563,8 @@ heap_xlog_vacuum(XLogReaderState *record)
 			ItemIdSetUnused(lp);
 		}
 
-		/*
-		 * Update the page's hint bit about whether it has free pointers
-		 */
-		PageSetHasFreeLinePointers(page);
+		/* Attempt to truncate line pointer array now */
+		PageTruncateLinePointerArray(page);
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f75502ca2c..3c8dc0af18 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -962,6 +962,7 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 		 */
 		for (;;)
 		{
+			Assert(OffsetNumberIsValid(nextoffnum) && nextoffnum <= maxoff);
 			lp = PageGetItemId(page, nextoffnum);
 
 			/* Check for broken chains */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a0db90b43f..c7bb0b1f23 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1448,7 +1448,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
 		{
 			/*
-			 * Wait until lazy_vacuum_heap_rel() to save free space.
+			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
+			 * doesn't just save us some cycles; it also allows us to record
+			 * any additional free space that lazy_vacuum_heap_page() will
+			 * make available in cases where it's possible to truncate the
+			 * page's line pointer array.
 			 *
 			 * Note that the one-pass (no indexes) case is only supposed to
 			 * make it this far when there were no LP_DEAD items during
@@ -2053,6 +2057,13 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
  * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
  * at all.
  *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
  * Note: the reason for doing this as a second pass is we cannot remove the
  * tuples until we've removed their index entries, and we want to process
  * index entry removal in batches as large as possible.
@@ -2195,7 +2206,8 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	Assert(uncnt > 0);
 
-	PageSetHasFreeLinePointers(page);
+	/* Attempt to truncate line pointer array now */
+	PageTruncateLinePointerArray(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 5d5989c2f5..6e820cd675 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -676,9 +676,10 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
  * PageRepairFragmentation
  *
  * Frees fragmented space on a page.
- * It doesn't remove unused line pointers! Please don't change this.
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
+ * It never removes unused line pointers, though PageTruncateLinePointerArray
+ * will do so at the point that VACUUM sets LP_DEAD items to LP_UNUSED.
  *
  * Caller had better have a super-exclusive lock on page's buffer.  As a side
  * effect the page's PD_HAS_FREE_LINES hint bit will be set or unset as
@@ -784,6 +785,85 @@ PageRepairFragmentation(Page page)
 		PageClearHasFreeLinePointers(page);
 }
 
+/*
+ * PageTruncateLinePointerArray
+ *
+ * Removes unused line pointers at the end of the line pointer array.
+ *
+ * This routine is usable for heap pages only.  It is called by VACUUM during
+ * its second pass over the heap.  We expect that there will be at least one
+ * LP_UNUSED line pointer on the page (if VACUUM didn't have an LP_DEAD item
+ * to set LP_UNUSED on the page then it wouldn't have visited the page).
+ *
+ * Caller can have either an exclusive lock or a super-exclusive lock on
+ * page's buffer.  As a side effect the page's PD_HAS_FREE_LINES hint bit will
+ * be set as needed.
+ */
+void
+PageTruncateLinePointerArray(Page page)
+{
+	PageHeader	phdr = (PageHeader) page;
+	ItemId		lp;
+	bool		truncating = true;
+	bool		sethint = false;
+	int			nline,
+				nunusedend;
+
+	/*
+	 * Scan original line pointer array backwards to determine how to far to
+	 * truncate to.  Note that we avoid truncating the line pointer to 0 items
+	 * in all cases.
+	 */
+	nline = PageGetMaxOffsetNumber(page);
+	nunusedend = 0;
+
+	for (int i = nline; i >= FirstOffsetNumber; i--)
+	{
+		lp = PageGetItemId(page, i);
+
+		if (truncating && i > FirstOffsetNumber)
+		{
+			/*
+			 * Still counting which line pointers from the end of the array
+			 * can be truncated away.
+			 *
+			 * If this is another LP_UNUSED line pointer (or the first), count
+			 * it among those we'll truncate.  Otherwise stop considering
+			 * further LP_UNUSED line pointers for truncation, but continue
+			 * with scan of array in any case.
+			 */
+			if (!ItemIdIsUsed(lp))
+				nunusedend++;
+			else
+				truncating = false;
+		}
+		else
+		{
+			if (!ItemIdIsUsed(lp))
+			{
+				/*
+				 * This is an unused line pointer that we won't be truncating
+				 * away -- so there is at least one.  Set hint on page.
+				 */
+				sethint = true;
+				break;
+			}
+		}
+	}
+
+	Assert(nline > nunusedend);
+	if (nunusedend > 0)
+		phdr->pd_lower -= sizeof(ItemIdData) * nunusedend;
+	else
+		Assert(sethint);
+
+	/* Set hint bit for PageAddItemExtended */
+	if (sethint)
+		PageSetHasFreeLinePointers(page);
+	else
+		PageClearHasFreeLinePointers(page);
+}
+
 /*
  * PageGetFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
-- 
2.27.0

#99

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#98)

Re: New IndexAM API controlling index vacuum strategies

On Sun, Apr 4, 2021 at 11:00 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 1, 2021 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I've also found a way to further simplify the table-without-indexes
case: make it behave like a regular two-pass/has-indexes VACUUM with
regard to visibility map stuff when the page doesn't need a call to
lazy_vacuum_heap() (because there are no LP_DEAD items to set
LP_UNUSED on the page following pruning). But when it does call
lazy_vacuum_heap(), the call takes care of everything for
lazy_scan_heap(), which just continues to the next page due to
considering prunestate to have been "invalidated" by the call to
lazy_vacuum_heap(). So there is absolutely minimal special case code
for the table-without-indexes case now.

Attached is v10, which simplifies the one-pass/table-without-indexes
VACUUM as described.

Thank you for updating the patch.

* I now include a modified version of Matthias van de Meent's line
pointer truncation patch [1].

Matthias' patch seems very much in scope here. The broader patch
series establishes the principle that we can leave LP_DEAD line
pointers in an unreclaimed state indefinitely, without consequence
(beyond the obvious). We had better avoid line pointer bloat that
cannot be reversed when VACUUM does eventually get around to doing a
second pass over the heap. This is another case where it seems prudent
to keep the costs understandable/linear -- page-level line pointer
bloat seems like a cost that increases in a non-linear fashion, which
undermines the whole idea of modelling when it's okay to skip
index/heap vacuuming. (Also, line pointer bloat sucks.)

Line pointer truncation doesn't happen during pruning, as it did in
Matthias' original patch. In this revised version, line pointer
truncation occurs during the second phase of VACUUM. There are several
reasons to prefer this approach. It seems both safer and more useful
that way (compared to the approach of doing line pointer truncation
during pruning). It also makes intuitive sense to do it this way, at
least to me -- the second pass over the heap is supposed to be for
"freeing" LP_DEAD line pointers.

0002, 0003, and 0004 patches look good to me. 0001 and 0005 also look
good to me but I have some trivial review comments on them.

0001 patch:

                /*
-                * Now that stats[idx] points to the DSM segment, we
don't need the
-                * locally allocated results.
+                * Now that top-level indstats[idx] points to the DSM
segment, we
+                * don't need the locally allocated results.
                 */
-               pfree(*stats);
-               *stats = bulkdelete_res;
+               pfree(istat);
+               istat = bulkdelete_res;

Did you try the change around parallel_process_one_index() that I
suggested in the previous reply[1]/messages/by-id/CAD21AoDOWo4H6vmtLZoJ2SznMp_zOej2Kww+JBkVRPXs+j48uw@mail.gmail.com? If we don't change the logic, we
need to update the above comment. Previously, we update stats[idx] in
vacuum_one_index() (renamed to parallel_process_one_index()) but with
your patch, where we update it is its caller.

---
+lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
-       Assert(!IsParallelWorker());
-       Assert(nindexes > 0);
+       Assert(vacrel->nindexes > 0);
+       Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
+       Assert(MultiXactIdIsValid(vacrel->relminmxid));

and

-       Assert(!IsParallelWorker());
-       Assert(nindexes > 0);
+       Assert(vacrel->nindexes > 0);

We removed two Assert(!IsParallelWorker()) at two places. It seems to
me that those assertions are still valid. Do we really need to remove
them?

---
0004 patch:

src/backend/access/heap/heapam.c:638: trailing whitespace.
+ /*

I found a whitespace issue.

---
0005 patch:

+ * Caller is expected to call here before and after vacuuming each index in
+ * the case of two-pass VACUUM, or every BYPASS_EMERGENCY_MIN_PAGES blocks in
+ * the case of no-indexes/one-pass VACUUM.

I think it should be "every VACUUM_FSM_EVERY_PAGES blocks" instead of
"every BYPASS_EMERGENCY_MIN_PAGES blocks".

---
+/*
+ * Threshold that controls whether we bypass index vacuuming and heap
+ * vacuuming.  When we're under the threshold they're deemed unnecessary.
+ * BYPASS_THRESHOLD_PAGES is applied as a multiplier on the table's rel_pages
+ * for those pages known to contain one or more LP_DEAD items.
+ */
+#define BYPASS_THRESHOLD_PAGES 0.02    /* i.e. 2% of rel_pages */
+
+#define BYPASS_EMERGENCY_MIN_PAGES \
+   ((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
+

I think we need a description for BYPASS_EMERGENCY_MIN_PAGES.

---
for (int idx = 0; idx < vacrel->nindexes; idx++)
{
Relation indrel = vacrel->indrels[idx];
IndexBulkDeleteResult *istat = vacrel->indstats[idx];

            vacrel->indstats[idx] =
                lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
                                      vacrel);
+
+           if (should_speedup_failsafe(vacrel))
+           {
+               /* Wraparound emergency -- end current index scan */
+               allindexes = false;
+               break;
+           }

allindexes can be false even if we process all indexes, which is fine
with me because setting allindexes = false disables the subsequent
heap vacuuming. I think it's appropriate behavior in emergency cases.
In that sense, can we do should_speedup_failsafe() check also after
parallel index vacuuming? And we can also check it at the beginning of
lazy vacuum.

Regards,

[1]: /messages/by-id/CAD21AoDOWo4H6vmtLZoJ2SznMp_zOej2Kww+JBkVRPXs+j48uw@mail.gmail.com

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#100

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#98)

Re: New IndexAM API controlling index vacuum strategies

On Sun, 4 Apr 2021 at 04:00, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 1, 2021 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

I've also found a way to further simplify the table-without-indexes
case: make it behave like a regular two-pass/has-indexes VACUUM with
regard to visibility map stuff when the page doesn't need a call to
lazy_vacuum_heap() (because there are no LP_DEAD items to set
LP_UNUSED on the page following pruning). But when it does call
lazy_vacuum_heap(), the call takes care of everything for
lazy_scan_heap(), which just continues to the next page due to
considering prunestate to have been "invalidated" by the call to
lazy_vacuum_heap(). So there is absolutely minimal special case code
for the table-without-indexes case now.

Attached is v10, which simplifies the one-pass/table-without-indexes
VACUUM as described.

Great!

Other changes (some of which are directly related to the
one-pass/table-without-indexes refactoring):

* The second patch no longer breaks up lazy_scan_heap() into multiple
functions -- we only retain the lazy_scan_prune() function, which is
the one that I find very compelling.

This addresses Robert's concern about the functions -- I think that
it's much better this way, now that I see it.

* No more diff churn in the first patch. This was another concern held
by Robert, as well as by Masahiko.

In general both the first and second patches are much easier to follow now.

* The emergency mechanism is now able to kick in when we happen to be
doing a one-pass/table-without-indexes VACUUM -- no special
cases/"weasel words" are needed.

* Renamed "onerel" to "rel" in the first patch, per Robert's suggestion.

* Fixed various specific issues raised by Masahiko's review,
particularly in the first patch and last patch in the series.

Finally, there is a new patch added to the series in v10:

* I now include a modified version of Matthias van de Meent's line
pointer truncation patch [1].

Thanks for notifying. I've noticed that you've based this on v3 of
that patch, and consequently has at least one significant bug that I
fixed in v5 of that patchset:

0004:

@@ -962,6 +962,7 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
*/
for (;;)
{
+            Assert(OffsetNumberIsValid(nextoffnum) && nextoffnum <= maxoff);
lp = PageGetItemId(page, nextoffnum);

/* Check for broken chains */

This assertion is false, and should be a guarding if-statement. HOT
redirect pointers are not updated if the tuple they're pointing to is
vacuumed (i.e. when it was never committed) so this nextoffnum might
in a correctly working system point past maxoff.

Line pointer truncation doesn't happen during pruning, as it did in
Matthias' original patch. In this revised version, line pointer
truncation occurs during the second phase of VACUUM. There are several
reasons to prefer this approach. It seems both safer and more useful
that way (compared to the approach of doing line pointer truncation
during pruning). It also makes intuitive sense to do it this way, at
least to me -- the second pass over the heap is supposed to be for
"freeing" LP_DEAD line pointers.

Good catch for running a line pointer truncating pass at the second
pass over the heap in VACUUM, but I believe that it is also very
useful for pruning. Line pointer bloat due to excessive HOT chains
cannot be undone until the 2nd run of VACUUM happens with this patch,
which is a missed chance for all non-vacuum pruning.

Many workloads rely heavily on opportunistic pruning. With a workload
that benefits a lot from HOT (e.g. pgbench with heap fillfactor
reduced to 90), there are many LP_UNUSED line pointers, even though we
may never have a VACUUM that actually performs a second heap pass
(because LP_DEAD items cannot accumulate in heap pages). Prior to the
HOT commit in 2007, LP_UNUSED line pointers were strictly something
that VACUUM created from dead tuples. It seems to me that we should
only target the latter "category" of LP_UNUSED line pointers when
considering truncating the array -- we ought to leave pruning
(especially opportunistic pruning that takes place outside of VACUUM)
alone.

What difference is there between opportunistically pruned HOT line
pointers, and VACUUMed line pointers? Truncating during pruning has
the benefit of keeping the LP array short where possible, and seeing
that truncating the LP array allows for more applied
PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
truncate the LP array whenever clearing up space.

Other than those questions, some comments on the other patches:

0002:

+    appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
+                     (long long) vacrel->lpdead_item_pages);

I presume this should use vacrel->lpdead_items?.

0003:

+ * ...  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs ...

This should be '... where their tuple header XIDs...'

+retry:
+
...
+        res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+        if (unlikely(res == HEAPTUPLE_DEAD))
+            goto retry;

In this unlikely case, you reset the tuples_deleted value that was
received earlier from heap_page_prune. This results in inaccurate
statistics, as repeated calls to heap_page_prune on the same page will
not count tuples that were deleted in a previous call.

0004:

+     * truncate to.  Note that we avoid truncating the line pointer to 0 items
+     * in all cases.
+     */

Is there a specific reason that I'm not getting as to why this is necessary?

0005:

+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.

This documentation doesn't quite make it clear what its relationship
is with autovacuum_freeze_max_age. How about the following: "...

VACUUM< will use the higher of this value and 105% of
guc-autovacuum-freeze-max-age<, so that only ...". It's only slightly

easier to read, but at least it conveys that values lower than 105% of
autovacuum_freeze_max_age are not considered. The same can be said for
the multixact guc documentation.

With regards,

Matthias van de Meent

#101

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#99)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 5, 2021 at 4:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Did you try the change around parallel_process_one_index() that I
suggested in the previous reply[1]? If we don't change the logic, we
need to update the above comment. Previously, we update stats[idx] in
vacuum_one_index() (renamed to parallel_process_one_index()) but with
your patch, where we update it is its caller.

I don't know how I missed it the first time. I agree that it is a lot
better that way.

I did it that way in the version of the patch that I pushed just now. Thanks!

Do you think that it's okay that we rely on the propagation of global
state to parallel workers on Postgres 13? Don't we need something like
my fixup commit 49f49def on Postgres 13 as well? At least for the
EXEC_BACKEND case, I think.

We removed two Assert(!IsParallelWorker()) at two places. It seems to
me that those assertions are still valid. Do we really need to remove
them?

I have restored the assertions in what became the final version.

0004 patch:

src/backend/access/heap/heapam.c:638: trailing whitespace.

Will fix.

---
0005 patch:
+ * Caller is expected to call here before and after vacuuming each index in
+ * the case of two-pass VACUUM, or every BYPASS_EMERGENCY_MIN_PAGES blocks in
+ * the case of no-indexes/one-pass VACUUM.
I think it should be "every VACUUM_FSM_EVERY_PAGES blocks" instead of
"every BYPASS_EMERGENCY_MIN_PAGES blocks".

Will fix.

+#define BYPASS_EMERGENCY_MIN_PAGES \
+   ((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
+
I think we need a description for BYPASS_EMERGENCY_MIN_PAGES.

I agree - will fix.

allindexes can be false even if we process all indexes, which is fine
with me because setting allindexes = false disables the subsequent
heap vacuuming. I think it's appropriate behavior in emergency cases.
In that sense, can we do should_speedup_failsafe() check also after
parallel index vacuuming? And we can also check it at the beginning of
lazy vacuum.

Those both seem like good ideas. Especially the one about checking
right at the start. Now that the patch makes the emergency mechanism
not apply a delay (not just skip index vacuuming), having a precheck
at the very start makes a lot of sense. This also makes VACUUM hurry
in the case where there was a dangerously slow VACUUM that happened to
not be aggressive. Such a VACUUM will use the emergency mechanism but
won't advance relfrozenxid, because we have to rely on the autovacuum
launcher launching an anti-wraparound/aggressive autovacuum
immediately afterwards. We want that second anti-wraparound VACUUM to
hurry from the very start of lazy_scan_heap().

--
Peter Geoghegan

#102

Tom Lane

tgl@sss.pgh.pa.us

almost 5 years ago

In reply to: Peter Geoghegan (#101)

Re: New IndexAM API controlling index vacuum strategies

Peter Geoghegan <pg@bowt.ie> writes:

Do you think that it's okay that we rely on the propagation of global
state to parallel workers on Postgres 13? Don't we need something like
my fixup commit 49f49def on Postgres 13 as well? At least for the
EXEC_BACKEND case, I think.

Uh ... *what* propagation of global state to parallel workers? Workers
fork off from the postmaster, not from their leader process.

(I note that morepork is still failing. The other ones didn't report
in yet.)

regards, tom lane

#103

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Tom Lane (#102)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Apr 6, 2021 at 8:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Do you think that it's okay that we rely on the propagation of global
state to parallel workers on Postgres 13? Don't we need something like
my fixup commit 49f49def on Postgres 13 as well? At least for the
EXEC_BACKEND case, I think.

Uh ... *what* propagation of global state to parallel workers? Workers
fork off from the postmaster, not from their leader process.

Right. I think we should apply that fix on PG13 as well.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#104

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Tom Lane (#102)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 5, 2021 at 4:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

Do you think that it's okay that we rely on the propagation of global
state to parallel workers on Postgres 13? Don't we need something like
my fixup commit 49f49def on Postgres 13 as well? At least for the
EXEC_BACKEND case, I think.

Uh ... *what* propagation of global state to parallel workers? Workers
fork off from the postmaster, not from their leader process.

(I note that morepork is still failing. The other ones didn't report
in yet.)

Evidently my fixup commit 49f49def was written in way too much of a
panic. I'm going to push a new fix shortly. This will make workers do
their own GetAccessStrategy(BAS_VACUUM), just to get the buildfarm
green.

REL_13_STABLE will need to be considered separately. I still haven't
figured out how this ever appeared to work for this long. The
vac_strategy/bstrategy state simply wasn't propagated at all.

--
Peter Geoghegan

#105

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#104)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-05 16:53:58 -0700, Peter Geoghegan wrote:

REL_13_STABLE will need to be considered separately. I still haven't
figured out how this ever appeared to work for this long. The
vac_strategy/bstrategy state simply wasn't propagated at all.

What do you mean with "appear to work"? Isn't, in 13, the only
consequence of vac_strategy not being "propagated" that we'll not use a
strategy in parallel workers? Presumably that was hard to notice
because most people don't run manual VACUUM with cost limits turned
on. And autovacuum doesn't use parallelism.

Greetings,

Andres Freund

#106

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Andres Freund (#105)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 5, 2021 at 5:00 PM Andres Freund <andres@anarazel.de> wrote:

What do you mean with "appear to work"? Isn't, in 13, the only
consequence of vac_strategy not being "propagated" that we'll not use a
strategy in parallel workers? Presumably that was hard to notice
because most people don't run manual VACUUM with cost limits turned
on. And autovacuum doesn't use parallelism.

Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
be initialized to 0, simply because it's static and global. That
explains it.

--
Peter Geoghegan

#107

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#106)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:

Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
be initialized to 0, simply because it's static and global. That
explains it.

So do we need to allocate a strategy in workers now, or leave things
as they are/were?

I'm going to go ahead with pushing my commit to do that now, just to
get the buildfarm green. It's still a bug in Postgres 13, albeit a
less serious one than I first suspected.

--
Peter Geoghegan

#108

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Peter Geoghegan (#107)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-05 17:18:37 -0700, Peter Geoghegan wrote:

On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:

Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
be initialized to 0, simply because it's static and global. That
explains it.

So do we need to allocate a strategy in workers now, or leave things
as they are/were?

I'm going to go ahead with pushing my commit to do that now, just to
get the buildfarm green. It's still a bug in Postgres 13, albeit a
less serious one than I first suspected.

Feels like a v13 bug to me, one that should be fixed.

Greetings,

Andres Freund

#109

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Matthias van de Meent (#100)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 5, 2021 at 2:44 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

This assertion is false, and should be a guarding if-statement. HOT
redirect pointers are not updated if the tuple they're pointing to is
vacuumed (i.e. when it was never committed) so this nextoffnum might
in a correctly working system point past maxoff.

I will need to go through this in detail soon.

Line pointer truncation doesn't happen during pruning, as it did in
Matthias' original patch. In this revised version, line pointer
truncation occurs during the second phase of VACUUM. There are several
reasons to prefer this approach. It seems both safer and more useful
that way (compared to the approach of doing line pointer truncation
during pruning). It also makes intuitive sense to do it this way, at
least to me -- the second pass over the heap is supposed to be for
"freeing" LP_DEAD line pointers.

Good catch for running a line pointer truncating pass at the second
pass over the heap in VACUUM, but I believe that it is also very
useful for pruning. Line pointer bloat due to excessive HOT chains
cannot be undone until the 2nd run of VACUUM happens with this patch,
which is a missed chance for all non-vacuum pruning.

Maybe - I have my doubts about it having much value outside of the
more extreme cases. But let's assume that I'm wrong about that, for
the sake of argument.

The current plan is to no longer require a super-exclusive lock inside
lazy_vacuum_heap_page(), which means that we can no longer safely call
PageRepairFragmentation() at that point. This will mean that
PageRepairFragmentation() is 100% owned by pruning. And so the
question of whether or not line pointer truncation should also happen
in PageRepairFragmentation() to cover pruning is (or will be) a
totally separate question to the question of how
lazy_vacuum_heap_page() does it. Nothing stops you from independently
pursuing that as a project for Postgres 15.

What difference is there between opportunistically pruned HOT line
pointers, and VACUUMed line pointers?

The fact that they are physically identical to each other isn't
everything. The "life cycle" of an affected page is crucially
important.

I find that there is a lot of value in thinking about how things look
at the page level moment to moment, and even over hours and days.
Usually with a sample workload and table in mind. I already mentioned
the new_order table from TPC-C, which is characterized by continual
churn from more-or-less even amounts of range deletes and bulk inserts
over time. That seems to be the kind of workload where you see big
problems with line pointer bloat. Because there is constant churn of
unrelated logical rows (it's not a bunch of UPDATEs).

It's possible for very small effects to aggregate into large and
significant effects -- I know this from my experience with indexing.
Plus the FSM is probably not very smart about fragmentation, which
makes it even more complicated. And so it's easy to be wrong if you
predict that some seemingly insignificant extra intervention couldn't
possibly help. For that reason, I don't want to predict that you're
wrong now. It's just a question of time, and of priorities.

Truncating during pruning has
the benefit of keeping the LP array short where possible, and seeing
that truncating the LP array allows for more applied
PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
truncate the LP array whenever clearing up space.

Truncating the line pointer array is not an intrinsic good. I hesitate
to do it during pruning in the absence of clear evidence that it's
independently useful. Pruning is a very performance sensitive
operation. Much more so than VACUUM's second heap pass.

Other than those questions, some comments on the other patches:

0002:
+    appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
+                     (long long) vacrel->lpdead_item_pages);
I presume this should use vacrel->lpdead_items?.

It should have been, but as it happens I have decided to not do this
at all in 0002-*. Better to not report on LP_UNUSED *or* LP_DEAD items
at this point of VACUUM VERBOSE output.

0003:
+ * ...  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs ...
This should be '... where their tuple header XIDs...'

Will fix.

+retry:
+
...
+        res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+        if (unlikely(res == HEAPTUPLE_DEAD))
+            goto retry;
In this unlikely case, you reset the tuples_deleted value that was
received earlier from heap_page_prune. This results in inaccurate
statistics, as repeated calls to heap_page_prune on the same page will
not count tuples that were deleted in a previous call.

I don't think that it matters. The "tupgone=true" case has no test
coverage (see coverage.postgresql.org), and it would be hard to ensure
that the "res == HEAPTUPLE_DEAD" that replaces it gets coverage, for
the same reasons. Keeping the rules as simple as possible seem like a
good goal. What's more, it's absurdly unlikely that this will happen
even once. The race is very tight. Postgres will do opportunistic
pruning at almost any point, often from a SELECT, so the chances of
anybody noticing an inaccuracy from this issue in particular are
remote in the extreme.

Actually, a big problem with the tuples_deleted value surfaced by both
log_autovacuum and by VACUUM VERBOSE is that it can be wildly
different to the number of LP_DEAD items. This is commonly the case
with tables that get lots of non-HOT updates, with opportunistic
pruning kicking in a lot, with LP_DEAD items constantly accumulating.
By the time VACUUM comes around, it reports an absurdly low
tuples_deleted because it's using this what-I-pruned-just-now
definition. The opposite extreme is also possible, since there might
be far fewer LP_DEAD items when VACUUM does a lot of pruning of HOT
chains specifically.

0004:
+     * truncate to.  Note that we avoid truncating the line pointer to 0 items
+     * in all cases.
+     */
Is there a specific reason that I'm not getting as to why this is necessary?

I didn't say it was strictly necessary. There is special-case handling
of PageIsEmpty() at various points, though, including within VACUUM.
It seemed worth avoiding hitting that. Perhaps I should change it to
not work that way.

0005:
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
This documentation doesn't quite make it clear what its relationship
is with autovacuum_freeze_max_age. How about the following: "...

VACUUM< will use the higher of this value and 105% of
guc-autovacuum-freeze-max-age<, so that only ...". It's only slightly

easier to read, but at least it conveys that values lower than 105% of
autovacuum_freeze_max_age are not considered. The same can be said for
the multixact guc documentation.

This does need work too.

I'm going to push 0002- and 0003- tomorrow morning pacific time. I'll
publish a new set of patches tomorrow, once I've finished that up. The
last 2 patches will require a lot of focus to get over the line for
Postgres 14.

--
Peter Geoghegan

#110

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#109)

Re: New IndexAM API controlling index vacuum strategies

On Tue, 6 Apr 2021 at 05:13, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Apr 5, 2021 at 2:44 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

This assertion is false, and should be a guarding if-statement. HOT
redirect pointers are not updated if the tuple they're pointing to is
vacuumed (i.e. when it was never committed) so this nextoffnum might
in a correctly working system point past maxoff.

I will need to go through this in detail soon.

Line pointer truncation doesn't happen during pruning, as it did in
Matthias' original patch. In this revised version, line pointer
truncation occurs during the second phase of VACUUM. There are several
reasons to prefer this approach. It seems both safer and more useful
that way (compared to the approach of doing line pointer truncation
during pruning). It also makes intuitive sense to do it this way, at
least to me -- the second pass over the heap is supposed to be for
"freeing" LP_DEAD line pointers.

Good catch for running a line pointer truncating pass at the second
pass over the heap in VACUUM, but I believe that it is also very
useful for pruning. Line pointer bloat due to excessive HOT chains
cannot be undone until the 2nd run of VACUUM happens with this patch,
which is a missed chance for all non-vacuum pruning.

Maybe - I have my doubts about it having much value outside of the
more extreme cases. But let's assume that I'm wrong about that, for
the sake of argument.

The current plan is to no longer require a super-exclusive lock inside
lazy_vacuum_heap_page(), which means that we can no longer safely call
PageRepairFragmentation() at that point. This will mean that
PageRepairFragmentation() is 100% owned by pruning. And so the
question of whether or not line pointer truncation should also happen
in PageRepairFragmentation() to cover pruning is (or will be) a
totally separate question to the question of how
lazy_vacuum_heap_page() does it. Nothing stops you from independently
pursuing that as a project for Postgres 15.

Ah, then I misunderstood your intentions when you mentioned including
a modified version of my patch. In which case, I agree that improving
HOT pruning is indeed out of scope.

What difference is there between opportunistically pruned HOT line
pointers, and VACUUMed line pointers?

The fact that they are physically identical to each other isn't
everything. The "life cycle" of an affected page is crucially
important.

I find that there is a lot of value in thinking about how things look
at the page level moment to moment, and even over hours and days.
Usually with a sample workload and table in mind. I already mentioned
the new_order table from TPC-C, which is characterized by continual
churn from more-or-less even amounts of range deletes and bulk inserts
over time. That seems to be the kind of workload where you see big
problems with line pointer bloat. Because there is constant churn of
unrelated logical rows (it's not a bunch of UPDATEs).

It's possible for very small effects to aggregate into large and
significant effects -- I know this from my experience with indexing.
Plus the FSM is probably not very smart about fragmentation, which
makes it even more complicated. And so it's easy to be wrong if you
predict that some seemingly insignificant extra intervention couldn't
possibly help. For that reason, I don't want to predict that you're
wrong now. It's just a question of time, and of priorities.

Truncating during pruning has
the benefit of keeping the LP array short where possible, and seeing
that truncating the LP array allows for more applied
PD_HAS_FREE_LINES-optimization, I fail to see why you wouldn't want to
truncate the LP array whenever clearing up space.

Truncating the line pointer array is not an intrinsic good. I hesitate
to do it during pruning in the absence of clear evidence that it's
independently useful. Pruning is a very performance sensitive
operation. Much more so than VACUUM's second heap pass.
Other than those questions, some comments on the other patches:

0002:
+    appendStringInfo(&buf, _("There were %lld dead item identifiers.\n"),
+                     (long long) vacrel->lpdead_item_pages);
I presume this should use vacrel->lpdead_items?.
It should have been, but as it happens I have decided to not do this
at all in 0002-*. Better to not report on LP_UNUSED *or* LP_DEAD items
at this point of VACUUM VERBOSE output.
0003:
+ * ...  Aborted transactions
+ * have tuples that we can treat as DEAD without caring about where there
+ * tuple header XIDs ...
This should be '... where their tuple header XIDs...'
Will fix.
+retry:
+
...
+        res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+
+        if (unlikely(res == HEAPTUPLE_DEAD))
+            goto retry;
In this unlikely case, you reset the tuples_deleted value that was
received earlier from heap_page_prune. This results in inaccurate
statistics, as repeated calls to heap_page_prune on the same page will
not count tuples that were deleted in a previous call.
I don't think that it matters. The "tupgone=true" case has no test
coverage (see coverage.postgresql.org), and it would be hard to ensure
that the "res == HEAPTUPLE_DEAD" that replaces it gets coverage, for
the same reasons. Keeping the rules as simple as possible seem like a
good goal. What's more, it's absurdly unlikely that this will happen
even once. The race is very tight. Postgres will do opportunistic
pruning at almost any point, often from a SELECT, so the chances of
anybody noticing an inaccuracy from this issue in particular are
remote in the extreme.

Actually, a big problem with the tuples_deleted value surfaced by both
log_autovacuum and by VACUUM VERBOSE is that it can be wildly
different to the number of LP_DEAD items. This is commonly the case
with tables that get lots of non-HOT updates, with opportunistic
pruning kicking in a lot, with LP_DEAD items constantly accumulating.
By the time VACUUM comes around, it reports an absurdly low
tuples_deleted because it's using this what-I-pruned-just-now
definition. The opposite extreme is also possible, since there might
be far fewer LP_DEAD items when VACUUM does a lot of pruning of HOT
chains specifically.

That seems reasonable as well.

0004:
+     * truncate to.  Note that we avoid truncating the line pointer to 0 items
+     * in all cases.
+     */
Is there a specific reason that I'm not getting as to why this is necessary?
I didn't say it was strictly necessary. There is special-case handling
of PageIsEmpty() at various points, though, including within VACUUM.
It seemed worth avoiding hitting that.

That seems reasonable.

Perhaps I should change it to not work that way.

All cases of PageIsEmpty on heap pages seem to be optimized short-path
handling of empty pages in vacuum, so I'd say that it is better to
fully truncate the array, but I'd be fully OK with postponing that
specific change for further analysis.

0005:
+        The default is 1.8 billion transactions. Although users can set this value
+        anywhere from zero to 2.1 billion, <command>VACUUM</command> will silently
+        adjust the effective value more than 105% of
+        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that only
+        anti-wraparound autovacuums and aggressive scans have a chance to skip
+        index cleanup.
This documentation doesn't quite make it clear what its relationship
is with autovacuum_freeze_max_age. How about the following: "...

VACUUM< will use the higher of this value and 105% of
guc-autovacuum-freeze-max-age<, so that only ...". It's only slightly

easier to read, but at least it conveys that values lower than 105% of
autovacuum_freeze_max_age are not considered. The same can be said for
the multixact guc documentation.
This does need work too.

I'm going to push 0002- and 0003- tomorrow morning pacific time. I'll
publish a new set of patches tomorrow, once I've finished that up. The
last 2 patches will require a lot of focus to get over the line for
Postgres 14.

If you have updated patches, I'll try to check them this evening (CEST).

With regards,

Matthias van de Meent

#111

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Matthias van de Meent (#110)

2 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Apr 6, 2021 at 7:05 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

If you have updated patches, I'll try to check them this evening (CEST).

Here is v11, which is not too different from v10 as far as the
truncation stuff goes.

Masahiko should take a look at the last patch again. I renamed the
GUCs to reflect the fact that we do everything possible to advance
relfrozenxid in the case where the fail safe mechanism kicks in -- not
just skipping index vacuuming. It also incorporates your most recent
round of feedback.

Thanks
--
Peter Geoghegan

Attachments:

v11-0001-Truncate-line-pointer-array-during-VACUUM.patchapplication/octet-stream; name=v11-0001-Truncate-line-pointer-array-during-VACUUM.patchDownload

From 975131a29a2a5290147d4c813da1479a6268ac18 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 30 Mar 2021 19:43:06 -0700
Subject: [PATCH v11 1/2] Truncate line pointer array during VACUUM.

Truncate each heap page's line pointer array when a contiguous group of
LP_UNUSED item pointers appears at the end of the array.  This happens
during VACUUM's second pass over the heap.  In practice most affected
LP_UNUSED line pointers are truncated away at the same point that VACUUM
marks them LP_UNUSED (from LP_DEAD).

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
---
 src/include/storage/bufpage.h        |   1 +
 src/backend/access/heap/heapam.c     |  22 ++++--
 src/backend/access/heap/pruneheap.c  |   4 +
 src/backend/access/heap/vacuumlazy.c |  16 +++-
 src/backend/storage/page/bufpage.c   | 112 ++++++++++++++++++++++++++-
 5 files changed, 144 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 359b749f7f..c86ccdaf60 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -441,6 +441,7 @@ extern Page PageGetTempPageCopy(Page page);
 extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
+extern void PageTruncateLinePointerArray(Page page);
 extern Size PageGetFreeSpace(Page page);
 extern Size PageGetFreeSpaceForMultipleTuples(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9cbc161d7a..4d5247fb0b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -635,8 +635,15 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
+			/*
+			 * The previous returned tuple may have been vacuumed since the
+			 * previous scan when we use a non-MVCC snapshot, so we must
+			 * re-establish the lineoff <= PageGetMaxOffsetNumber(dp)
+			 * invariant
+			 */
 			lineoff =			/* previous offnum */
-				OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self)));
+				Min(lines,
+					OffsetNumberPrev(ItemPointerGetOffsetNumber(&(tuple->t_self))));
 		}
 		/* page and lineoff now reference the physically previous tid */
 
@@ -678,6 +685,13 @@ heapgettup(HeapScanDesc scan,
 	lpp = PageGetItemId(dp, lineoff);
 	for (;;)
 	{
+		/*
+		 * Only continue scanning the page while we have lines left.
+		 *
+		 * Note that this protects us from accessing line pointers past
+		 * PageGetMaxOffsetNumber(); both for forward scans when we resume
+		 * the table scan, and for when we start scanning a new page.
+		 */
 		while (linesleft > 0)
 		{
 			if (ItemIdIsNormal(lpp))
@@ -8556,10 +8570,8 @@ heap_xlog_vacuum(XLogReaderState *record)
 			ItemIdSetUnused(lp);
 		}
 
-		/*
-		 * Update the page's hint bit about whether it has free pointers
-		 */
-		PageSetHasFreeLinePointers(page);
+		/* Attempt to truncate line pointer array now */
+		PageTruncateLinePointerArray(page);
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f75502ca2c..0c8e49d3e6 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -962,6 +962,10 @@ heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
 		 */
 		for (;;)
 		{
+			/* Sanity check */
+			if (nextoffnum < FirstOffsetNumber || nextoffnum > maxoff)
+				break;
+
 			lp = PageGetItemId(page, nextoffnum);
 
 			/* Check for broken chains */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 446e3bc452..1d55d0ecf9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1444,7 +1444,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
 		{
 			/*
-			 * Wait until lazy_vacuum_heap_rel() to save free space.
+			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
+			 * doesn't just save us some cycles; it also allows us to record
+			 * any additional free space that lazy_vacuum_heap_page() will
+			 * make available in cases where it's possible to truncate the
+			 * page's line pointer array.
 			 *
 			 * Note: The one-pass (no indexes) case is only supposed to make
 			 * it this far when there were no LP_DEAD items during pruning.
@@ -2033,6 +2037,13 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
  * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
  * at all.
  *
+ * We may also be able to truncate the line pointer array of the heap pages we
+ * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
+ * array, it can be reclaimed as free space.  These LP_UNUSED items usually
+ * start out as LP_DEAD items recorded by lazy_scan_prune (we set items from
+ * each page to LP_UNUSED, and then consider if it's possible to truncate the
+ * page's line pointer array).
+ *
  * Note: the reason for doing this as a second pass is we cannot remove the
  * tuples until we've removed their index entries, and we want to process
  * index entry removal in batches as large as possible.
@@ -2175,7 +2186,8 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	Assert(uncnt > 0);
 
-	PageSetHasFreeLinePointers(page);
+	/* Attempt to truncate line pointer array now */
+	PageTruncateLinePointerArray(page);
 
 	/*
 	 * Mark buffer dirty before we write WAL.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 5d5989c2f5..a4eed5cdcd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -250,8 +250,17 @@ PageAddItemExtended(Page page,
 		/* if no free slot, we'll put it at limit (1st open slot) */
 		if (PageHasFreeLinePointers(phdr))
 		{
-			/* Look for "recyclable" (unused) ItemId */
-			for (offsetNumber = 1; offsetNumber < limit; offsetNumber++)
+			/*
+			 * Scan line pointer array to locate a "recyclable" (unused)
+			 * ItemId.
+			 *
+			 * Always use earlier items first.  PageTruncateLinePointerArray
+			 * can only truncate unused items when they appear as a contiguous
+			 * group at the end of the line pointer array.
+			 */
+			for (offsetNumber = FirstOffsetNumber;
+				 offsetNumber < limit;		/* limit is maxoff+1 */
+				 offsetNumber++)
 			{
 				itemId = PageGetItemId(phdr, offsetNumber);
 
@@ -675,11 +684,23 @@ compactify_tuples(itemIdCompact itemidbase, int nitems, Page page, bool presorte
 /*
  * PageRepairFragmentation
  *
- * Frees fragmented space on a page.
- * It doesn't remove unused line pointers! Please don't change this.
+ * Frees fragmented space on a heap page following pruning.
  *
  * This routine is usable for heap pages only, but see PageIndexMultiDelete.
  *
+ * Never removes unused line pointers.  PageTruncateLinePointerArray can
+ * safely remove some unused line pointers.  It ought to be safe for this
+ * routine to free unused line pointers in roughly the same way, but it's not
+ * clear that that would be beneficial.
+ *
+ * PageTruncateLinePointerArray is only called during VACUUM's second pass
+ * over the heap.  Any unused line pointers that it sees are likely to have
+ * been set to LP_UNUSED (from LP_DEAD) immediately before the time it is
+ * called.  On the other hand, many tables have the vast majority of all
+ * required pruning performed opportunistically (not during VACUUM).  And so
+ * there is, in general, a good chance that all of the unused line pointers
+ * we'll see on the page are ceaselessly recycled, again and again.
+ *
  * Caller had better have a super-exclusive lock on page's buffer.  As a side
  * effect the page's PD_HAS_FREE_LINES hint bit will be set or unset as
  * needed.
@@ -784,6 +805,89 @@ PageRepairFragmentation(Page page)
 		PageClearHasFreeLinePointers(page);
 }
 
+/*
+ * PageTruncateLinePointerArray
+ *
+ * Removes unused line pointers at the end of the line pointer array.
+ *
+ * This routine is usable for heap pages only.  It is called by VACUUM during
+ * its second pass over the heap.  We expect at least one LP_UNUSED line
+ * pointer on the page (if VACUUM didn't have an LP_DEAD item on the page that
+ * it just set to LP_UNUSED then it should not call here).
+ *
+ * We avoid truncating the line pointer array to 0 items, if necessary by
+ * leaving behind a single remaining LP_UNUSED item.  This is a little
+ * arbitrary, but it seems like a good idea to avoid leaving a PageIsEmpty()
+ * page behind.  That is treated as a special case by VACUUM.
+ *
+ * Caller can have either an exclusive lock or a super-exclusive lock on
+ * page's buffer.  The page's PD_HAS_FREE_LINES hint bit will be set or unset
+ * based on whether or not we leave behind any remaining LP_UNUSED items.
+ */
+void
+PageTruncateLinePointerArray(Page page)
+{
+	PageHeader	phdr = (PageHeader) page;
+	bool		countdone = false,
+				sethint = false;
+	int			nunusedend = 0;
+
+	/* Scan line pointer array back-to-front */
+	for (int i = PageGetMaxOffsetNumber(page); i >= FirstOffsetNumber; i--)
+	{
+		ItemId		lp = PageGetItemId(page, i);
+
+		if (!countdone && i > FirstOffsetNumber)
+		{
+			/*
+			 * Still determining which line pointers from the end of the array
+			 * will be truncated away.  Either count another line pointer as
+			 * safe to truncate, or notice that it's not safe to truncate
+			 * additional line pointers (stop counting line pointers).
+			 */
+			if (!ItemIdIsUsed(lp))
+				nunusedend++;
+			else
+				countdone = true;
+		}
+		else
+		{
+			/*
+			 * Once we've stopped counting we still need to figure out if
+			 * there are any remaining LP_UNUSED line pointers somewhere more
+			 * towards the front of the array.
+			 */
+			if (!ItemIdIsUsed(lp))
+			{
+				/*
+				 * This is an unused line pointer that we won't be truncating
+				 * away -- so there is at least one.  Set hint on page.
+				 */
+				sethint = true;
+				break;
+			}
+		}
+	}
+
+	if (nunusedend > 0)
+	{
+		phdr->pd_lower -= sizeof(ItemIdData) * nunusedend;
+
+#ifdef CLOBBER_FREED_MEMORY
+		memset((char *) page + phdr->pd_lower, 0x7F,
+			   sizeof(ItemIdData) * nunusedend);
+#endif
+	}
+	else
+		Assert(sethint);
+
+	/* Set hint bit for PageAddItemExtended */
+	if (sethint)
+		PageSetHasFreeLinePointers(page);
+	else
+		PageClearHasFreeLinePointers(page);
+}
+
 /*
  * PageGetFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
-- 
2.27.0

v11-0002-Bypass-index-vacuuming-in-some-cases.patchapplication/octet-stream; name=v11-0002-Bypass-index-vacuuming-in-some-cases.patchDownload

From 9e189f97e9b447e20ad60dbd9e8e183c6f4f19ff Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 28 Mar 2021 20:55:55 -0700
Subject: [PATCH v11 2/2] Bypass index vacuuming in some cases.

Bypass index vacuuming in two cases: The case where there are almost no
dead tuples in indexes, as an optimization, and the case where a table's
relfrozenxid is dangerously far in the past, as a failsafe to avoid
wraparound failure.

The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and
vacuum_multixact_failsafe_age.  These specify the age at which VACUUM
should take extraordinary measures in order to advance relfrozenxid
and/or relminmxid before a system-wide wraparound failure takes place.

Note also that the failsafe has VACUUM stop applying any cost-based
delay that may be in affect.

Author: Masahiko Sawada <sawada.mshk@gmail.com>
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   4 +
 src/backend/access/heap/vacuumlazy.c          | 306 +++++++++++++++++-
 src/backend/commands/vacuum.c                 |  64 ++++
 src/backend/utils/misc/guc.c                  |  25 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 doc/src/sgml/config.sgml                      |  66 ++++
 6 files changed, 449 insertions(+), 18 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d029da5ac0..9179ad223f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,8 @@ extern int	vacuum_freeze_min_age;
 extern int	vacuum_freeze_table_age;
 extern int	vacuum_multixact_freeze_min_age;
 extern int	vacuum_multixact_freeze_table_age;
+extern int	vacuum_failsafe_age;
+extern int	vacuum_multixact_failsafe_age;
 
 /* Variables for cost-based parallel vacuum */
 extern pg_atomic_uint32 *VacuumSharedCostBalance;
@@ -270,6 +272,8 @@ extern void vacuum_set_xid_limits(Relation rel,
 								  TransactionId *xidFullScanLimit,
 								  MultiXactId *multiXactCutoff,
 								  MultiXactId *mxactFullScanLimit);
+extern bool vacuum_xid_limit_emergency(TransactionId relfrozenxid,
+									   MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1d55d0ecf9..a27cdf1eb0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -103,6 +103,19 @@
 #define VACUUM_TRUNCATE_LOCK_WAIT_INTERVAL		50	/* ms */
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
 
+/*
+ * Threshold that controls whether we bypass index vacuuming and heap
+ * vacuuming as an optimization
+ */
+#define BYPASS_THRESHOLD_PAGES	0.02	/* i.e. 2% of rel_pages */
+
+/*
+ * When a table is small (i.e. smaller than this), save cycles by avoiding
+ * repeated emergency fail safe checks
+ */
+#define BYPASS_EMERGENCY_MIN_PAGES \
+	((BlockNumber) (((uint64) 4 * 1024 * 1024 * 1024) / BLCKSZ))
+
 /*
  * When a table has no indexes, vacuum the FSM after every 8GB, approximately
  * (it won't be exact because we only vacuum FSM after processing a heap page
@@ -299,6 +312,7 @@ typedef struct LVRelState
 	/* Do index vacuuming/cleanup? */
 	bool		do_index_vacuuming;
 	bool		do_index_cleanup;
+	bool		do_failsafe_speedup;
 
 	/* Buffer access strategy and parallel state */
 	BufferAccessStrategy bstrategy;
@@ -392,13 +406,14 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
-static void lazy_vacuum(LVRelState *vacrel);
-static void lazy_vacuum_all_indexes(LVRelState *vacrel);
+static void lazy_vacuum(LVRelState *vacrel, bool onecall);
+static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int tupindex, Buffer *vmbuffer);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
 									LVRelState *vacrel);
+static bool should_speedup_failsafe(LVRelState *vacrel);
 static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
 static void do_parallel_vacuum_or_cleanup(LVRelState *vacrel, int nworkers);
@@ -544,6 +559,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 					 &vacrel->indrels);
 	vacrel->do_index_vacuuming = true;
 	vacrel->do_index_cleanup = true;
+	vacrel->do_failsafe_speedup = false;
 	if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
 	{
 		vacrel->do_index_vacuuming = false;
@@ -749,6 +765,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) VacuumPageHit,
 							 (long long) VacuumPageMiss,
 							 (long long) VacuumPageDirty);
+			if (vacrel->rel_pages > 0)
+			{
+				msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
+
+				if (vacrel->nindexes == 0 || (vacrel->do_index_vacuuming &&
+											  vacrel->num_index_scans == 0))
+					appendStringInfo(&buf, _("index scan not needed:"));
+				else if (vacrel->do_index_vacuuming && vacrel->num_index_scans > 0)
+					appendStringInfo(&buf, _("index scan needed:"));
+				else
+				{
+					msgfmt = _(" %u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
+
+					if (!vacrel->do_failsafe_speedup)
+						appendStringInfo(&buf, _("index scan bypassed:"));
+					else
+						appendStringInfo(&buf, _("index scan bypassed due to emergency:"));
+				}
+				appendStringInfo(&buf, msgfmt,
+								 vacrel->lpdead_item_pages,
+								 100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+								 (long long) vacrel->lpdead_items);
+			}
 			for (int i = 0; i < vacrel->nindexes; i++)
 			{
 				IndexBulkDeleteResult *istat = vacrel->indstats[i];
@@ -839,7 +878,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				next_fsm_block_to_vacuum;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				have_vacuumed_indexes = false;
 	StringInfoData buf;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
@@ -975,6 +1015,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	else
 		skipping_blocks = false;
 
+	/*
+	 * Before beginning heap scan, check if it's already necessary to apply
+	 * fail safe speedup
+	 */
+	should_speedup_failsafe(vacrel);
+
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -1091,7 +1137,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			}
 
 			/* Remove the collected garbage tuples from table and indexes */
-			lazy_vacuum(vacrel);
+			lazy_vacuum(vacrel, false);
+			have_vacuumed_indexes = true;
 
 			/*
 			 * Vacuum the Free Space Map to make newly-freed space visible on
@@ -1311,12 +1358,17 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * Periodically perform FSM vacuuming to make newly-freed
 				 * space visible on upper FSM pages.  Note we have not yet
 				 * performed FSM processing for blkno.
+				 *
+				 * This is also a good time to call should_speedup_failsafe(),
+				 * since we also don't want to do that too frequently or too
+				 * infrequently.
 				 */
 				if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
 				{
 					FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
 											blkno);
 					next_fsm_block_to_vacuum = blkno;
+					should_speedup_failsafe(vacrel);
 				}
 
 				/*
@@ -1450,6 +1502,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * make available in cases where it's possible to truncate the
 			 * page's line pointer array.
 			 *
+			 * Note: It's not in fact 100% certain that we really will call
+			 * lazy_vacuum_heap_rel() -- lazy_vacuum() might yet opt to skip
+			 * index vacuuming (and so must skip heap vacuuming).  This is
+			 * deemed okay because it only happens in emergencies, or when
+			 * there is very little free space anyway. (Besides, we start
+			 * recording free space in the FSM once index vacuuming has been
+			 * abandoned.)
+			 *
 			 * Note: The one-pass (no indexes) case is only supposed to make
 			 * it this far when there were no LP_DEAD items during pruning.
 			 */
@@ -1493,13 +1553,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	}
 
 	/* If any tuples need to be deleted, perform final vacuum cycle */
-	/* XXX put a threshold on min number of tuples here? */
 	if (dead_tuples->num_tuples > 0)
-		lazy_vacuum(vacrel);
+		lazy_vacuum(vacrel, !have_vacuumed_indexes);
 
 	/*
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes.
+	 * not there were indexes, and whether or not we bypassed index vacuuming.
 	 */
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
@@ -1526,6 +1585,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * If table has no indexes and at least one heap pages was vacuumed, make
 	 * log report that lazy_vacuum_heap_rel would've made had there been
 	 * indexes (having indexes implies using the two pass strategy).
+	 *
+	 * We deliberately don't do this in the case where there are indexes but
+	 * index vacuuming was bypassed.  We make a similar report at the point
+	 * that index vacuuming is bypassed, but that's actually quite different
+	 * in one important sense: it shows information about work we _haven't_
+	 * done.
+	 *
+	 * log_autovacuum output does things differently; it consistently presents
+	 * information about LP_DEAD items for the VACUUM as a whole.  We always
+	 * report on each round of index and heap vacuuming separately, though.
 	 */
 	if (vacrel->nindexes == 0 && vacrel->lpdead_item_pages > 0)
 		ereport(elevel,
@@ -1953,10 +2022,19 @@ retry:
 
 /*
  * Remove the collected garbage tuples from the table and its indexes.
+ *
+ * We may choose to bypass index vacuuming at this point.
+ *
+ * In rare emergencies, the ongoing VACUUM operation can be made to skip both
+ * index vacuuming and index cleanup at the point we're called.  This avoids
+ * having the whole system refuse to allocate further XIDs/MultiXactIds due to
+ * wraparound.
  */
 static void
-lazy_vacuum(LVRelState *vacrel)
+lazy_vacuum(LVRelState *vacrel, bool onecall)
 {
+	bool		do_bypass_optimization;
+
 	/* Should not end up here with no indexes */
 	Assert(vacrel->nindexes > 0);
 	Assert(!IsParallelWorker());
@@ -1964,16 +2042,102 @@ lazy_vacuum(LVRelState *vacrel)
 
 	if (!vacrel->do_index_vacuuming)
 	{
-		Assert(!vacrel->do_index_cleanup);
 		vacrel->dead_tuples->num_tuples = 0;
 		return;
 	}
 
-	/* Okay, we're going to do index vacuuming */
-	lazy_vacuum_all_indexes(vacrel);
+	/*
+	 * Consider bypassing index vacuuming (and heap vacuuming) entirely.
+	 *
+	 * We currently only do this in cases where the number of LP_DEAD items
+	 * for the entire VACUUM operation is close to zero.  This avoids sharp
+	 * discontinuities in the duration and overhead of successive VACUUM
+	 * operations that run against the same table with a fixed workload.
+	 * Ideally, successive VACUUM operations will behave as if there are
+	 * exactly zero LP_DEAD items in cases where there are close to zero.
+	 *
+	 * This is likely to be helpful with a table that is continually affected
+	 * by UPDATEs that can mostly apply the HOT optimization, but occasionally
+	 * have small aberrations that lead to just a few heap pages retaining
+	 * only one or two LP_DEAD items.  This is pretty common; even when the
+	 * DBA goes out of their way to make UPDATEs use HOT, it is practically
+	 * impossible to predict whether HOT will be applied in 100% of cases.
+	 * It's far easier to ensure that 99%+ of all UPDATEs against a table use
+	 * HOT through careful tuning.
+	 */
+	do_bypass_optimization = false;
+	if (onecall && vacrel->rel_pages > 0)
+	{
+		BlockNumber threshold;
 
-	/* Remove tuples from heap */
-	lazy_vacuum_heap_rel(vacrel);
+		Assert(vacrel->num_index_scans == 0);
+		Assert(vacrel->lpdead_items == vacrel->dead_tuples->num_tuples);
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		/*
+		 * This crossover point at which we'll start to do index vacuuming is
+		 * expressed as a percentage of the total number of heap pages in the
+		 * table that are known to have at least one LP_DEAD item.  This is
+		 * much more important than the total number of LP_DEAD items, since
+		 * it's a proxy for the number of heap pages whose visibility map bits
+		 * cannot be set on account of bypassing index and heap vacuuming.
+		 *
+		 * We apply one further precautionary test: the space currently used
+		 * to store the TIDs (TIDs that now all point to LP_DEAD items) must
+		 * not exceed 32MB.  This limits the risk that we will bypass index
+		 * vacuuming again and again until eventually there is a VACUUM whose
+		 * dead_tuples space is not CPU cache resident.
+		 */
+		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
+		do_bypass_optimization =
+			(vacrel->lpdead_item_pages < threshold &&
+			 vacrel->lpdead_items < MAXDEADTUPLES(32L * 1024L * 1024L));
+	}
+
+	if (do_bypass_optimization)
+	{
+		/*
+		 * There are almost zero TIDs.  Behave as if there were precisely
+		 * zero: bypass index vacuuming, but do index cleanup.
+		 *
+		 * We expect that the ongoing VACUUM operation will finish very
+		 * quickly, so there is no point in considering speeding up as a
+		 * failsafe against wraparound failure. (Index cleanup is expected to
+		 * finish very quickly in cases where there were no ambulkdelete()
+		 * calls.)
+		 */
+		vacrel->do_index_vacuuming = false;
+		ereport(elevel,
+				(errmsg("\"%s\": index scan bypassed: %u pages from table (%.2f%% of total) have %lld dead item identifiers",
+						vacrel->relname, vacrel->rel_pages,
+						100.0 * vacrel->lpdead_item_pages / vacrel->rel_pages,
+						(long long) vacrel->lpdead_items)));
+	}
+	else if (lazy_vacuum_all_indexes(vacrel))
+	{
+		/*
+		 * We successfully completed a round of index vacuuming.  Do related
+		 * heap vacuuming now.
+		 */
+		lazy_vacuum_heap_rel(vacrel);
+	}
+	else
+	{
+		/*
+		 * Emergency case.
+		 *
+		 * we attempted index vacuuming, but didn't finish a full round/full
+		 * index scan.  This happens when relfrozenxid or relminmxid is too
+		 * far in the past.
+		 *
+		 * From this point on the VACUUM operation will do no further index
+		 * vacuuming or heap vacuuming.  It will do any remaining pruning that
+		 * may be required, plus other heap-related and relation-level
+		 * maintenance tasks.  But that's it.
+		 */
+		Assert(vacrel->do_failsafe_speedup);
+	}
 
 	/*
 	 * Forget the now-vacuumed tuples -- just press on
@@ -1983,10 +2147,17 @@ lazy_vacuum(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_all_indexes() -- Main entry for index vacuuming
+ *
+ * Returns true in the common case when all indexes were successfully
+ * vacuumed.  Returns false in rare cases where we determined that the ongoing
+ * VACUUM operation is at risk of taking too long to finish, leading to
+ * wraparound failure.
  */
-static void
+static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
+	bool		allindexes = true;
+
 	Assert(!IsParallelWorker());
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
@@ -1994,6 +2165,13 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
 	Assert(MultiXactIdIsValid(vacrel->relminmxid));
 
+	/* Precheck for XID wraparound emergencies */
+	if (should_speedup_failsafe(vacrel))
+	{
+		/* Wraparound emergency -- don't even start an index scan */
+		return false;
+	}
+
 	/* Report that we are now vacuuming indexes */
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 								 PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
@@ -2008,26 +2186,50 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			vacrel->indstats[idx] =
 				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
 									  vacrel);
+
+			if (should_speedup_failsafe(vacrel))
+			{
+				/* Wraparound emergency -- end current index scan */
+				allindexes = false;
+				break;
+			}
 		}
 	}
 	else
 	{
 		/* Outsource everything to parallel variant */
 		do_parallel_lazy_vacuum_all_indexes(vacrel);
+
+		/*
+		 * Do a postcheck to consider applying wraparound failsafe now.  Note
+		 * that parallel VACUUM only gets the precheck and this postcheck.
+		 */
+		if (should_speedup_failsafe(vacrel))
+			allindexes = false;
 	}
 
 	/*
 	 * We delete all LP_DEAD items from the first heap pass in all indexes on
-	 * each call here.  This makes the next call to lazy_vacuum_heap_rel()
-	 * safe.
+	 * each call here (except calls where we choose to do the fail safe).
+	 * This makes the next call to lazy_vacuum_heap_rel() safe (except in the
+	 * event of the fail safe triggering, which prevents the next call from
+	 * taking place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
 		   vacrel->dead_tuples->num_tuples == vacrel->lpdead_items);
+	Assert(allindexes || vacrel->do_failsafe_speedup);
 
-	/* Increase and report the number of index scans */
+	/*
+	 * Increase and report the number of index scans.
+	 *
+	 * We deliberately include the case where we started a round of bulk
+	 * deletes that we weren't able to finish due to the fail safe triggering.
+	 */
 	vacrel->num_index_scans++;
 	pgstat_progress_update_param(PROGRESS_VACUUM_NUM_INDEX_VACUUMS,
 								 vacrel->num_index_scans);
+
+	return allindexes;
 }
 
 /*
@@ -2320,6 +2522,76 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
 	return (offnum <= maxoff);
 }
 
+/*
+ * Determine if there is an unacceptable risk of wraparound failure due to the
+ * fact that the ongoing VACUUM is taking too long -- the table that is being
+ * vacuumed should not have a relfrozenxid or relminmxid that is too far in
+ * the past.
+ *
+ * Note that we deliberately don't vary our behavior based on factors like
+ * whether or not the ongoing VACUUM is aggressive.  If it's not aggressive we
+ * probably won't be able to advance relfrozenxid during this VACUUM.  If we
+ * can't, then an anti-wraparound VACUUM should take place immediately after
+ * we finish up.  We should be able to bypass all index vacuuming for the
+ * later anti-wraparound VACUUM.
+ *
+ * If the user-configurable threshold has been crossed then hurry things up:
+ * Stop applying any VACUUM cost delay going forward, and remember to skip any
+ * further index vacuuming (and heap vacuuming, at least in the common case
+ * where table has indexes).
+ *
+ * Return true to inform caller of the emergency.  Otherwise return false.
+ *
+ * Caller is expected to call here before and after vacuuming each index in
+ * the case of two-pass VACUUM, or every VACUUM_FSM_EVERY_PAGES blocks in the
+ * case of no-indexes/one-pass VACUUM.
+ */
+static bool
+should_speedup_failsafe(LVRelState *vacrel)
+{
+	/* Avoid calling vacuum_xid_limit_emergency() very frequently */
+	if (vacrel->num_index_scans == 0 &&
+		vacrel->rel_pages <= BYPASS_EMERGENCY_MIN_PAGES)
+		return false;
+
+	/* Don't warn more than once per VACUUM */
+	if (vacrel->do_failsafe_speedup)
+		return true;
+
+	if (unlikely(vacuum_xid_limit_emergency(vacrel->relfrozenxid,
+											vacrel->relminmxid)))
+	{
+		/*
+		 * Wraparound emergency -- the table's relfrozenxid or relminmxid is
+		 * too far in the past
+		 */
+		Assert(vacrel->do_index_vacuuming);
+		Assert(vacrel->do_index_cleanup);
+
+		vacrel->do_index_vacuuming = false;
+		vacrel->do_index_cleanup = false;
+		vacrel->do_failsafe_speedup = true;
+
+		ereport(WARNING,
+				(errmsg("abandoned index vacuuming of table \"%s.%s.%s\" as a fail safe after %d index scans",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace,
+						vacrel->relname,
+						vacrel->num_index_scans),
+				 errdetail("table's relfrozenxid or relminmxid is too far in the past"),
+				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
+						 "You might also need to consider other ways for VACUUM to keep up with the allocation of transaction IDs.")));
+
+		/* Stop applying cost limits from this point on */
+		VacuumCostActive = false;
+		VacuumCostBalance = 0;
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Perform lazy_vacuum_all_indexes() steps in parallel
  */
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 25465b05dd..43eb84f538 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -62,6 +62,8 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_failsafe_age;
+int			vacuum_multixact_failsafe_age;
 
 
 /* A few variables that don't seem worth passing around as parameters */
@@ -1134,6 +1136,68 @@ vacuum_set_xid_limits(Relation rel,
 	}
 }
 
+/*
+ * vacuum_xid_limit_emergency() -- Used by VACUUM's fail safe emergency
+ * wraparound mechanism to determine if its table's relfrozenxid and
+ * relminmxid now are dangerously far in the past.
+ *
+ * When we return true, VACUUM caller will take extraordinary measures to
+ * avoid wraparound failure.
+ *
+ * Input parameters are the target relation's relfrozenxid and relminmxid.
+ */
+bool
+vacuum_xid_limit_emergency(TransactionId relfrozenxid, MultiXactId relminmxid)
+{
+	TransactionId xid_skip_limit;
+	MultiXactId	  multi_skip_limit;
+	int			  skip_index_vacuum;
+
+	Assert(TransactionIdIsNormal(relfrozenxid));
+	Assert(MultiXactIdIsValid(relminmxid));
+
+	/*
+	 * Determine the index skipping age to use. In any case not less than
+	 * autovacuum_freeze_max_age * 1.05, so that VACUUM always does an
+	 * aggressive scan.
+	 */
+	skip_index_vacuum = Max(vacuum_failsafe_age, autovacuum_freeze_max_age * 1.05);
+
+	xid_skip_limit = ReadNextTransactionId() - skip_index_vacuum;
+	if (!TransactionIdIsNormal(xid_skip_limit))
+		xid_skip_limit = FirstNormalTransactionId;
+
+	if (TransactionIdPrecedes(relfrozenxid, xid_skip_limit))
+	{
+		/* The table's relfrozenxid is too old */
+		return true;
+	}
+
+	/*
+	 * Similar to above, determine the index skipping age to use for multixact.
+	 * In any case not less than autovacuum_multixact_freeze_max_age * 1.05.
+	 */
+	skip_index_vacuum = Max(vacuum_multixact_failsafe_age,
+							autovacuum_multixact_freeze_max_age * 1.05);
+
+	/*
+	 * Compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	multi_skip_limit = ReadNextMultiXactId() - skip_index_vacuum;
+	if (multi_skip_limit < FirstMultiXactId)
+		multi_skip_limit = FirstMultiXactId;
+
+	if (MultiXactIdPrecedes(relminmxid, multi_skip_limit))
+	{
+		/* The table's relminmxid is too old */
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * vac_estimate_reltuples() -- estimate the new value for pg_class.reltuples
  *
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c9c9da85f3..46a48ecbe1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2647,6 +2647,26 @@ static struct config_int ConfigureNamesInt[] =
 		0, 0, 1000000,		/* see ComputeXidHorizons */
 		NULL, NULL, NULL
 	},
+	{
+		{"vacuum_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
+			NULL
+		},
+		&vacuum_failsafe_age,
+		/* This upper-limit can be 1.05 of autovacuum_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
+	{
+		{"vacuum_multixact_failsafe_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Multixact age at which VACUUM should trigger failsafe to avoid a wraparound outage."),
+			NULL
+		},
+		&vacuum_multixact_failsafe_age,
+		/* This upper-limit can be 1.05 of autovacuum_multixact_freeze_max_age */
+		1800000000, 0, 2100000000,
+		NULL, NULL, NULL
+	},
 
 	/*
 	 * See also CheckRequiredParameterValues() if this parameter changes
@@ -3247,7 +3267,10 @@ static struct config_int ConfigureNamesInt[] =
 			NULL
 		},
 		&autovacuum_freeze_max_age,
-		/* see pg_resetwal if you change the upper-limit value */
+		/*
+		 * see pg_resetwal and vacuum_failsafe_age if you change the
+		 * upper-limit value.
+		 */
 		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39da7cc942..445f696826 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -675,6 +675,8 @@
 #vacuum_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_freeze_table_age = 150000000
+#vacuum_failsafe_age = 1800000000
+#vacuum_multixact_failsafe_age = 1800000000
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index effc60c07b..a772a5cda9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8605,6 +8605,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-failsafe-age" xreflabel="vacuum_failsafe_age">
+      <term><varname>vacuum_failsafe_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_failsafe_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum age (in transactions) that a table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field can attain before <command>VACUUM</command> takes
+        extraordinary measures to avoid system-wide transaction ID
+        wraparound failure.  This is <command>VACUUM</command>'s
+        strategy of last resort.  The fail safe typically triggers
+        when an autovacuum to prevent transaction ID wraparound has
+        already been running for some time, though it's possible for
+        the fail safe to trigger during any <command>VACUUM</command>.
+       </para>
+       <para>
+        When the fail safe is triggered, any cost-based delay that is
+        in effect will no longer be applied, and further non-essential
+        maintenance tasks (such as index vacuuming) are bypassed.
+       </para>
+       <para>
+        The default is 1.8 billion transactions.  Although users can
+        set this value anywhere from zero to 2.1 billion,
+        <command>VACUUM</command> will silently adjust the effective
+        value to no less than 105% of <xref
+         linkend="guc-autovacuum-freeze-max-age"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-multixact-freeze-table-age" xreflabel="vacuum_multixact_freeze_table_age">
       <term><varname>vacuum_multixact_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -8651,6 +8684,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-multixact-failsafe-age" xreflabel="vacuum_multixact_failsafe_age">
+      <term><varname>vacuum_multixact_failsafe_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_multixact_failsafe_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum age (in transactions) that a table's
+        <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        field can attain before <command>VACUUM</command> takes
+        extraordinary measures to avoid system-wide multixact ID
+        wraparound failure.  This is <command>VACUUM</command>'s
+        strategy of last resort.  The fail safe typically triggers
+        when an autovacuum to prevent transaction ID wraparound has
+        already been running for some time, though it's possible for
+        the fail safe to trigger during any <command>VACUUM</command>.
+       </para>
+       <para>
+        When the fail safe is triggered, any cost-based delay that is
+        in effect will no longer be applied, and further non-essential
+        maintenance tasks (such as index vacuuming) are bypassed.
+       </para>
+       <para>
+        The default is 1.8 billion multixacts.  Although users can set
+        this value anywhere from zero to 2.1 billion,
+        <command>VACUUM</command> will silently adjust the effective
+        value to no less than 105% of <xref
+         linkend="guc-autovacuum-multixact-freeze-max-age"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-bytea-output" xreflabel="bytea_output">
       <term><varname>bytea_output</varname> (<type>enum</type>)
       <indexterm>
-- 
2.27.0

#112

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#111)

1 attachment(s)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 7, 2021 at 12:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Apr 6, 2021 at 7:05 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

If you have updated patches, I'll try to check them this evening (CEST).

Here is v11, which is not too different from v10 as far as the
truncation stuff goes.

Masahiko should take a look at the last patch again. I renamed the
GUCs to reflect the fact that we do everything possible to advance
relfrozenxid in the case where the fail safe mechanism kicks in -- not
just skipping index vacuuming. It also incorporates your most recent
round of feedback.

Thank you for updating the patches!

I've done the final round of review:

+       /*
+        * Before beginning heap scan, check if it's already necessary to apply
+        * fail safe speedup
+        */
+       should_speedup_failsafe(vacrel);

Maybe we can call it at an earlier point, for example before
lazy_space_alloc()? That way, we will not need to enable parallelism
if we know it's already an emergency situation.

---
+               msgfmt = _(" %u pages from table (%.2f%% of total) had
%lld dead item identifiers removed\n");
+
+               if (vacrel->nindexes == 0 || (vacrel->do_index_vacuuming &&
+                                             vacrel->num_index_scans == 0))
+                   appendStringInfo(&buf, _("index scan not needed:"));
+               else if (vacrel->do_index_vacuuming &&
vacrel->num_index_scans > 0)
+                   appendStringInfo(&buf, _("index scan needed:"));
+               else
+               {
+                   msgfmt = _(" %u pages from table (%.2f%% of total)
have %lld dead item identifiers\n");
+
+                   if (!vacrel->do_failsafe_speedup)
+                       appendStringInfo(&buf, _("index scan bypassed:"));
+                   else
+                       appendStringInfo(&buf, _("index scan bypassed
due to emergency:"));
+               }
+               appendStringInfo(&buf, msgfmt,
+                                vacrel->lpdead_item_pages,
+                                100.0 * vacrel->lpdead_item_pages /
vacrel->rel_pages,
+                                (long long) vacrel->lpdead_items);

I think we can make it clean if we check vacrel->do_index_vacuuming first.

I've attached the patch that proposes the change for the above points
and can be applied on top of 0002 patch. Please feel free to adopt or
reject it.

For 0001 patch, we call PageTruncateLinePointerArray() only in the
second pass over heap. I think we should note that the second pass is
called only when we found/made LP_DEAD on the page. That is, if all
dead tuples have been marked as LP_UNUSED by HOT pruning, the page
would not be processed by the second pass, resulting in not removing
LP_UNUSED at the end of line pointer array. So think we can call it in
this case, i.g., when lpdead_items is 0 and tuples_deleted > 0 in
lazy_scan_prune().

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

fix_proposal.patchapplication/octet-stream; name=fix_proposal.patchDownload

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a27cdf1eb0..4be2f167bf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -767,13 +767,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) VacuumPageDirty);
 			if (vacrel->rel_pages > 0)
 			{
-				msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
+				if (vacrel->do_index_vacuuming)
+				{
+					msgfmt = _(" %u pages from table (%.2f%% of total) had %lld dead item identifiers removed\n");
 
-				if (vacrel->nindexes == 0 || (vacrel->do_index_vacuuming &&
-											  vacrel->num_index_scans == 0))
-					appendStringInfo(&buf, _("index scan not needed:"));
-				else if (vacrel->do_index_vacuuming && vacrel->num_index_scans > 0)
-					appendStringInfo(&buf, _("index scan needed:"));
+					if (vacrel->nindexes == 0 || vacrel->num_index_scans == 0)
+						appendStringInfo(&buf, _("index scan not needed:"));
+					else
+						appendStringInfo(&buf, _("index scan needed:"));
+				}
 				else
 				{
 					msgfmt = _(" %u pages from table (%.2f%% of total) have %lld dead item identifiers\n");
@@ -928,6 +930,12 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
+	/*
+	 * Before beginning scan, check if it's already necessary to apply fail
+	 * safe speedup
+	 */
+	should_speedup_failsafe(vacrel);
+
 	/*
 	 * Allocate the space for dead tuples.  Note that this handles parallel
 	 * VACUUM initialization as part of allocating shared memory space used
@@ -1015,12 +1023,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	else
 		skipping_blocks = false;
 
-	/*
-	 * Before beginning heap scan, check if it's already necessary to apply
-	 * fail safe speedup
-	 */
-	should_speedup_failsafe(vacrel);
-
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -3445,7 +3447,7 @@ lazy_space_alloc(LVRelState *vacrel, int nworkers, BlockNumber nblocks)
 	 * be used for an index, so we invoke parallelism only if there are at
 	 * least two indexes on a table.
 	 */
-	if (nworkers >= 0 && vacrel->nindexes > 1)
+	if (nworkers >= 0 && vacrel->nindexes > 1 && vacrel->do_index_vacuuming)
 	{
 		/*
 		 * Since parallel workers cannot access data in temporary tables, we

#113

Amit Kapila

amit.kapila16@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#107)

Re: New IndexAM API controlling index vacuum strategies

On Tue, Apr 6, 2021 at 5:49 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Apr 5, 2021 at 5:09 PM Peter Geoghegan <pg@bowt.ie> wrote:

Oh yeah. "static BufferAccessStrategy vac_strategy" is guaranteed to
be initialized to 0, simply because it's static and global. That
explains it.

So do we need to allocate a strategy in workers now, or leave things
as they are/were?

I'm going to go ahead with pushing my commit to do that now, just to
get the buildfarm green. It's still a bug in Postgres 13, albeit a
less serious one than I first suspected.

I have started a separate thread [1]/messages/by-id/CAA4eK1KbmJgRV2W3BbzRnKUSrukN7SbqBBriC4RDB5KBhopkGQ@mail.gmail.com to fix this in PG-13.

[1]: /messages/by-id/CAA4eK1KbmJgRV2W3BbzRnKUSrukN7SbqBBriC4RDB5KBhopkGQ@mail.gmail.com

--
With Regards,
Amit Kapila.

#114

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Masahiko Sawada (#112)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 7, 2021 at 12:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for updating the patches!

I've done the final round of review:

All of the changes from your fixup patch are clear improvements, and
so I'll include them in the final commit. Thanks!

For 0001 patch, we call PageTruncateLinePointerArray() only in the
second pass over heap. I think we should note that the second pass is
called only when we found/made LP_DEAD on the page. That is, if all
dead tuples have been marked as LP_UNUSED by HOT pruning, the page
would not be processed by the second pass, resulting in not removing
LP_UNUSED at the end of line pointer array. So think we can call it in
this case, i.g., when lpdead_items is 0 and tuples_deleted > 0 in
lazy_scan_prune().

Maybe it would be beneficial to do that, but I haven't done it in the
version of the patch that I just pushed. We have run out of time to
consider calling PageTruncateLinePointerArray() in more places. I
think that the most important thing is that we have *some* protection
against line pointer bloat.

--
Peter Geoghegan

#115

Peter Geoghegan

pg@bowt.ie

almost 5 years ago

In reply to: Peter Geoghegan (#114)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

All of the changes from your fixup patch are clear improvements, and
so I'll include them in the final commit. Thanks!

I did change the defaults of the GUCs to 1.6 billion, though.

All patches in the patch series have been pushed. Hopefully I will not
be the next person to break the buildfarm today.

Thanks Masahiko, and everybody else involved!
--
Peter Geoghegan

#116

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Peter Geoghegan (#115)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 8, 2021 at 8:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

All of the changes from your fixup patch are clear improvements, and
so I'll include them in the final commit. Thanks!

I did change the defaults of the GUCs to 1.6 billion, though.

Okay.

All patches in the patch series have been pushed. Hopefully I will not
be the next person to break the buildfarm today.

Thanks Masahiko, and everybody else involved!

Thank you, too!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#117

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#116)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 8, 2021 at 11:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 8, 2021 at 8:41 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 7, 2021 at 8:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

All of the changes from your fixup patch are clear improvements, and
so I'll include them in the final commit. Thanks!

I realized that when the failsafe is triggered, we don't bypass heap
truncation that is performed before updating relfrozenxid. I think
it's better to bypass it too. What do you think?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#118

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Masahiko Sawada (#117)

Re: New IndexAM API controlling index vacuum strategies

On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I realized that when the failsafe is triggered, we don't bypass heap
truncation that is performed before updating relfrozenxid. I think
it's better to bypass it too. What do you think?

I agree. Bypassing heap truncation is exactly the kind of thing that
risks adding significant, unpredictable delay at a time when we need
to advance relfrozenxid as quickly as possible.

I pushed a trivial commit that makes the failsafe bypass heap
truncation as well just now.

Thanks
--
Peter Geoghegan

#119

Masahiko Sawada

sawada.mshk@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#118)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 4:59 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I realized that when the failsafe is triggered, we don't bypass heap
truncation that is performed before updating relfrozenxid. I think
it's better to bypass it too. What do you think?

I agree. Bypassing heap truncation is exactly the kind of thing that
risks adding significant, unpredictable delay at a time when we need
to advance relfrozenxid as quickly as possible.

I pushed a trivial commit that makes the failsafe bypass heap
truncation as well just now.

Great, thanks!

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#120

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#118)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-13 12:59:03 -0700, Peter Geoghegan wrote:

I agree. Bypassing heap truncation is exactly the kind of thing that
risks adding significant, unpredictable delay at a time when we need
to advance relfrozenxid as quickly as possible.

I pushed a trivial commit that makes the failsafe bypass heap
truncation as well just now.

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

I don't think it's that blindingly obvious that skipping truncation is
the right thing to do that it doesn't need review. Consider e.g. the
case that you're close to wraparound because you ran out of space for
the amount of WAL VACUUM produces, previously leading to autovacuums
being aborted / the server restarted. The user might then stop regular
activity and try to VACUUM. Skipping the truncation might now make it
harder to actually vacuum all the tables without running out of space.

FWIW, I also don't like that substantial behaviour changes to how vacuum
works were discussed only in a thread titled "New IndexAM API
controlling index vacuum strategies".

Greetings,

Andres Freund

#121

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#120)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

I think that the proper thing to do is to accept a certain amount of
risk in this area. The previous status quo was *appalling*, and so it
seems very unlikely that the failsafe hasn't mostly eliminated a lot
of risk for users. That factor is not everything, but it should count
for a lot. The only way that we're going to have total confidence in
anything like this is through the experience of it mostly working over
several releases.

I don't think it's that blindingly obvious that skipping truncation is
the right thing to do that it doesn't need review. Consider e.g. the
case that you're close to wraparound because you ran out of space for
the amount of WAL VACUUM produces, previously leading to autovacuums
being aborted / the server restarted. The user might then stop regular
activity and try to VACUUM. Skipping the truncation might now make it
harder to actually vacuum all the tables without running out of space.

Note that the criteria for whether or not "hastup=false" for a page is
slightly different in lazy_scan_prune() -- I added a comment that
points this out directly (the fact that it works that way is not new,
and might have originally been a happy mistake). Unlike
count_nondeletable_pages(), which is used by heap truncation,
lazy_scan_prune() is concerned about whether or not it's *likely to be
possible* to truncate away the page by the time lazy_truncate_heap()
is reached (if it is reached at all). And so it's optimistic about
LP_DEAD items that it observes being removed by
lazy_vacuum_heap_page() before we get to lazy_truncate_heap(). It's
inherently race-prone anyway, so it might as well assume that LP_DEAD
items will eventually become LP_UNUSED items later on.

It follows that the chances of lazy_truncate_heap() failing to
truncate anything when the failsafe has already triggered are
exceptionally high -- all the LP_DEAD items are still there, and
cannot be safely removed during truncation (for the usual reasons). I
just went one step further than that in this recent commit. I didn't
point these details out before now because (to me) this is beside the
point. Which is that the failsafe is just that -- a failsafe. Anything
that adds unnecessary unpredictable delay in reaching the point of
advancing relfrozenxid should be avoided. (Besides, the design of
should_attempt_truncation() and lazy_truncate_heap() is very far from
guaranteeing that truncation will take place at the best of times.)

FWIW, my intention is to try to get as much feedback about the
failsafe as I possibly can -- it's hard to reason about exceptional
events. I'm also happy to further discuss the specifics with you now.

--
Peter Geoghegan

#122

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#121)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

I think that the proper thing to do is to accept a certain amount of
risk in this area. The previous status quo was *appalling*, and so it
seems very unlikely that the failsafe hasn't mostly eliminated a lot
of risk for users. That factor is not everything, but it should count
for a lot. The only way that we're going to have total confidence in
anything like this is through the experience of it mostly working over
several releases.

I think this is largely missing the point Andres was making, which is
that you made a significant behavior change after feature freeze
without any real opportunity for discussion. More generally, you've
changed a bunch of other stuff relatively quickly based on input from
a relatively limited number of people. Now, it's fair to say that it's
often hard to get input on things, and sometimes you have to just take
your best shot and hope you're right. But in this particular case, you
didn't even try to get broader participation or buy-in. That's not
good.

--
Robert Haas
EDB: http://www.enterprisedb.com

#123

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Robert Haas (#122)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 5:08 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think this is largely missing the point Andres was making, which is
that you made a significant behavior change after feature freeze
without any real opportunity for discussion.

I don't believe that it was a significant behavior change, for the
reason I gave: the fact of the matter is that it's practically
impossible for us to truncate the heap anyway, provided we have
already decided to not vacuum (as opposed to prune) heap pages that
almost certainly have some LP_DEAD items in them. Note that later heap
pages are the most likely to still have some LP_DEAD items once the
failsafe triggers, which are precisely the ones that will affect
whether or not we can truncate the whole heap.

I accept that I could have done better with the messaging. I'll try to
avoid repeating that mistake in the future.

More generally, you've
changed a bunch of other stuff relatively quickly based on input from
a relatively limited number of people. Now, it's fair to say that it's
often hard to get input on things, and sometimes you have to just take
your best shot and hope you're right.

I agree in general, and I agree that that's what I've done in this
instance. It goes without saying, but I'll say it anyway: I accept
full responsibility.

But in this particular case, you
didn't even try to get broader participation or buy-in. That's not
good.

I will admit to being somewhat burned out by this project. That might
have been a factor.

--
Peter Geoghegan

#124

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#122)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-14 20:08:10 -0400, Robert Haas wrote:

On Wed, Apr 14, 2021 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

I think that the proper thing to do is to accept a certain amount of
risk in this area. The previous status quo was *appalling*, and so it
seems very unlikely that the failsafe hasn't mostly eliminated a lot
of risk for users. That factor is not everything, but it should count
for a lot. The only way that we're going to have total confidence in
anything like this is through the experience of it mostly working over
several releases.

I think this is largely missing the point Andres was making, which is
that you made a significant behavior change after feature freeze
without any real opportunity for discussion. More generally, you've
changed a bunch of other stuff relatively quickly based on input from
a relatively limited number of people. Now, it's fair to say that it's
often hard to get input on things, and sometimes you have to just take
your best shot and hope you're right. But in this particular case, you
didn't even try to get broader participation or buy-in. That's not
good.

Yep, that was what I was trying to get at.

- Andres

#125

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#121)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-14 14:55:36 -0700, Peter Geoghegan wrote:

On Wed, Apr 14, 2021 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:

I'm getting a bit bothered by the speed at which you're pushing fairly
substantial behavioural for vacuum. In this case without even a warning
that you're about to do so.

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

Among them that the code is not covered by tests and is unlikely to be
meaningfully exercised within the beta timeframe due to the timeframes
for hitting it (hard to actually hit below a 1/2 day running extreme
workloads, weeks for more realistic ones). Which means that this code
has to be extra vigorously reviewed, not the opposite. Or at least
tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
such should make that doable, or even just running a small test with
lower thresholds).

I just went one step further than that in this recent commit. I didn't
point these details out before now because (to me) this is beside the
point. Which is that the failsafe is just that -- a failsafe. Anything
that adds unnecessary unpredictable delay in reaching the point of
advancing relfrozenxid should be avoided. (Besides, the design of
should_attempt_truncation() and lazy_truncate_heap() is very far from
guaranteeing that truncation will take place at the best of times.)

This line of argumentation scares me. Not explained arguments, about
running in conditions that we otherwise don't run in, when in
exceptional circumstances. This code has a history of super subtle
interactions, with quite a few data loss causing bugs due to us not
forseeing some combination of circumstances.

I think there are good arguments for having logic for an "emergency
vacuum" mode (and also some good ones against). I'm not convinced that
the current set of things that are [not] skipped in failsafe mode is the
"obviously right set of things"™ but am convinced that there wasn't
enough consensus building o what that set of things is. This all also
would be different if it were the start of the development window,
rather than the end.

In my experience the big problem with vacuums in a wraparound situation
isn't actually things like truncation or eventhe index scans (although
they certainly can cause bad problems), but that VACUUM modifies
(prune/vacuum and WAL log or just setting hint bits) a crapton of pages
that don't actually need to be modified just to be able to get out of
the wraparound situation. And that the overhead of writing out all those
dirty pages + WAL logging causes the VACUUM to take unacceptably
long. E.g. because your storage is cloud storage with a few ms of
latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
writes that you end up with < 10MB/s of data being processed.

I think there's also a clear danger in having "cliffs" where the
behaviour changes appruptly once a certain threshold is reached. It's
not unlikely for systems to fall over entirely over when

a) autovacuum cost limiting is disabled. E.g. reaching your disk
iops/throughput quota and barely being able to log into postgres
anymore to kill the stuck connection causing the wraparound issue.

b) No index cleanup happens anymore. E.g. a workload with a lot of
bitmap index scans (which do not support killtuples) could end up a
off a lot worse because index pointers to dead tuples aren't being
cleaned up. In cases where an old transaction or leftover replication
slot is causing the problem (together a significant percentage of
wraparound situations) this situation will persist across repeated
(explicit or automatic) vacuums for a table, because relfrozenxid
won't actually be advanced. And this in turn might actually end up
slowing resolution of the wraparound issue more than doing all the
index scans.

Because this is a hard cliff rather than something phasing in, it's not
really possible to for a user to see this slowly getting worse and
addressing the issue. Especially for a) this could be addressed by not
turning off cost limiting at once, but instead have it decrease the
closer you get to some limit.

Greetings,

Andres Freund

#126

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#125)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 6:53 PM Andres Freund <andres@anarazel.de> wrote:

To a large degree the failsafe is something that is written in the
hope that it will never be needed. This is unlike most other things,
and has its own unique risks.

Among them that the code is not covered by tests and is unlikely to be
meaningfully exercised within the beta timeframe due to the timeframes
for hitting it (hard to actually hit below a 1/2 day running extreme
workloads, weeks for more realistic ones). Which means that this code
has to be extra vigorously reviewed, not the opposite.

There is test coverage for the optimization to bypass index vacuuming
with very few dead tuples. Plus we can expect it to kick in very
frequently. That's not as good as testing this other mechanism
directly, which I agree ought to happen too. But the difference
between those two cases is pretty much entirely around when and how
they kick in. I have normalized the idea that index vacuuming is
optional in principle, so in an important sense it is tested all the
time.

Or at least
tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
such should make that doable, or even just running a small test with
lower thresholds).

You know what else doesn't have test coverage? Any kind of aggressive
VACUUM. There is a problem with our culture around testing. I would
like to address that in the scope of this project, but you know how it
is. Can I take it that I'll have your support with adding those tests?

This line of argumentation scares me. Not explained arguments, about
running in conditions that we otherwise don't run in, when in
exceptional circumstances. This code has a history of super subtle
interactions, with quite a few data loss causing bugs due to us not
forseeing some combination of circumstances.

I'll say it again: I was wrong to not make that clearer prior to
committing the fixup. I regret that error, which probably had a lot to
do with being fatigued.

I think there are good arguments for having logic for an "emergency
vacuum" mode (and also some good ones against). I'm not convinced that
the current set of things that are [not] skipped in failsafe mode is the
"obviously right set of things"™ but am convinced that there wasn't
enough consensus building o what that set of things is. This all also
would be different if it were the start of the development window,
rather than the end.

I all but begged you to review the patches. Same with Robert. While
the earlier patches (where almost all of the complexity is) did get
review from both you and Robert (which I was grateful to receive), for
whatever reason neither of you looked at the later patches in detail.
(Robert said that the failsafe ought to cover single-pass/no-indexes
VACUUM at one point, which did influence the design of the failsafe,
but for the most part his input on the later stuff was minimal and
expressed in general terms.)

Of course, nothing stops us from improving the mechanism in the
future. Though I maintain that the fundamental approach of finishing
as quickly as possible is basically sound (short of fixing the problem
directly, for example by obviating the need for freezing).

In my experience the big problem with vacuums in a wraparound situation
isn't actually things like truncation or eventhe index scans (although
they certainly can cause bad problems), but that VACUUM modifies
(prune/vacuum and WAL log or just setting hint bits) a crapton of pages
that don't actually need to be modified just to be able to get out of
the wraparound situation.

Things like UUID indexes are very popular, and are likely to have an
outsized impact on dirtying pages (which I agree is the real problem).
Plus some people just have a ridiculous amount of indexes (e.g., the
Discourse table that they pointed out as a particularly good target
for deduplication had a total of 13 indexes). There is an excellent
chance that stuff like that is involved in installations that actually
have huge problems. The visibility map works pretty well these days --
but not for indexes.

And that the overhead of writing out all those
dirty pages + WAL logging causes the VACUUM to take unacceptably
long. E.g. because your storage is cloud storage with a few ms of
latency, and the ringbuffer + wal_buffer sizes cause so many synchronous
writes that you end up with < 10MB/s of data being processed.

This is a false dichotomy. There probably is an argument for making
the failsafe not do pruning that isn't strictly necessary (or
something like that) in a future release. I don't see what particular
significance that has for the failsafe mechanism now. The sooner we
can advance relfrozenxid when it's dangerously far in the past, the
better. It's true that the mechanism doesn't exploit every possible
opportunity to do so. But it mostly manages to do that.

I think there's also a clear danger in having "cliffs" where the
behaviour changes appruptly once a certain threshold is reached. It's
not unlikely for systems to fall over entirely over when

a) autovacuum cost limiting is disabled. E.g. reaching your disk
iops/throughput quota and barely being able to log into postgres
anymore to kill the stuck connection causing the wraparound issue.

Let me get this straight: You're concerned that hurrying up vacuuming
when we have 500 million XIDs left to burn will overwhelm the system,
which would presumably have finished in time otherwise? Even though it
would have to do way more work in absolute terms in the absence of the
failsafe? And even though the 1.6 billion XID age that we got to
before the failsafe kicked in was clearly not enough? You'd want to
"play it safe", and stick with the original plan at that point?

b) No index cleanup happens anymore. E.g. a workload with a lot of
bitmap index scans (which do not support killtuples) could end up a
off a lot worse because index pointers to dead tuples aren't being
cleaned up. In cases where an old transaction or leftover replication
slot is causing the problem (together a significant percentage of
wraparound situations) this situation will persist across repeated
(explicit or automatic) vacuums for a table, because relfrozenxid
won't actually be advanced. And this in turn might actually end up
slowing resolution of the wraparound issue more than doing all the
index scans.

If it's intrinsically impossible to advance relfrozenxid, then surely
all bets are off. But even in this scenario it's very unlikely that we
wouldn't at least do index vacuuming for those index tuples that are
dead and safe to delete according to the OldestXmin cutoff. You still
have 1.6 billion XIDs before the failsafe first kicks in, regardless
of the issue of the OldestXmin/FreezeLimit being excessively far in
the past.

You're also not acknowledging the benefit of avoiding uselessly
scanning the indexes again and again, which is mostly what would be
happening in this scenario. Maybe VACUUM shouldn't spin like this at
all, but that's not a new problem.

Because this is a hard cliff rather than something phasing in, it's not
really possible to for a user to see this slowly getting worse and
addressing the issue. Especially for a) this could be addressed by not
turning off cost limiting at once, but instead have it decrease the
closer you get to some limit.

There is a lot to be said for keeping the behavior as simple as
possible. You said so yourself. In any case I think that the perfect
should not be the enemy of the good (or the better, at least).

--
Peter Geoghegan

#127

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#126)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-14 19:53:29 -0700, Peter Geoghegan wrote:

Or at least
tests for it should be added (pg_resetwal + autovacuum_naptime=1s or
such should make that doable, or even just running a small test with
lower thresholds).

You know what else doesn't have test coverage? Any kind of aggressive
VACUUM. There is a problem with our culture around testing. I would
like to address that in the scope of this project, but you know how it
is. Can I take it that I'll have your support with adding those tests?

Sure!

I think there are good arguments for having logic for an "emergency
vacuum" mode (and also some good ones against). I'm not convinced that
the current set of things that are [not] skipped in failsafe mode is the
"obviously right set of things"™ but am convinced that there wasn't
enough consensus building o what that set of things is. This all also
would be different if it were the start of the development window,
rather than the end.

I all but begged you to review the patches. Same with Robert. While
the earlier patches (where almost all of the complexity is) did get
review from both you and Robert (which I was grateful to receive), for
whatever reason neither of you looked at the later patches in detail.

Based on a quick scan of the thread, the first version of a patch that
kind of resembles what got committed around the topic at hand seems to
be /messages/by-id/CAH2-Wzm7Y=_g3FjVHv7a85AfUbuSYdggDnEqN1hodVeOctL+Ow@mail.gmail.com
posted 2021-03-15. That's well into the last CF.

The reason I didn't do further reviews for things in this thread was
that I was trying really hard to get the shared memory stats patch into
a committable shape - there were just not enough hours in the day. I
think it's to be expected that, during the final CF, there aren't a lot
of resources for reviewing patches that are substantially new. Why
should these new patches have gotten priority over a much older patch
set that also address significant operational issues?

I think there's also a clear danger in having "cliffs" where the
behaviour changes appruptly once a certain threshold is reached. It's
not unlikely for systems to fall over entirely over when

a) autovacuum cost limiting is disabled. E.g. reaching your disk
iops/throughput quota and barely being able to log into postgres
anymore to kill the stuck connection causing the wraparound issue.

Let me get this straight: You're concerned that hurrying up vacuuming
when we have 500 million XIDs left to burn will overwhelm the system,
which would presumably have finished in time otherwise?
Even though it would have to do way more work in absolute terms in the
absence of the failsafe? And even though the 1.6 billion XID age that
we got to before the failsafe kicked in was clearly not enough? You'd
want to "play it safe", and stick with the original plan at that
point?

It's very common for larger / busier databases to *substantially*
increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
of headroom, but a few hundred million. The cost of doing unnecessary
anti-wraparound vacuums is just too great. And databases on the busier &
larger side of things are precisely the ones that are more likely to hit
wraparound issues (otherwise you're just not that likely to burn through
that many xids).

And my concern isn't really that vacuum would have finished without a
problem if cost limiting hadn't been disabled, but that having multiple
autovacuum workers going all out will cause problems. Like the system
slowing down so much that it's hard to fix the actual root cause of the
wraparound - I've seen systems with a bunch unthrottled autovacuum
overwhelme the IO subsystem so much that simply opening a connection to
fix the issue took 10+ minutes. Especially on systems with provisioned
IO (i.e. just about all cloud storage) that's not too hard to hit.

b) No index cleanup happens anymore. E.g. a workload with a lot of
bitmap index scans (which do not support killtuples) could end up a
off a lot worse because index pointers to dead tuples aren't being
cleaned up. In cases where an old transaction or leftover replication
slot is causing the problem (together a significant percentage of
wraparound situations) this situation will persist across repeated
(explicit or automatic) vacuums for a table, because relfrozenxid
won't actually be advanced. And this in turn might actually end up
slowing resolution of the wraparound issue more than doing all the
index scans.

If it's intrinsically impossible to advance relfrozenxid, then surely
all bets are off. But even in this scenario it's very unlikely that we
wouldn't at least do index vacuuming for those index tuples that are
dead and safe to delete according to the OldestXmin cutoff. You still
have 1.6 billion XIDs before the failsafe first kicks in, regardless
of the issue of the OldestXmin/FreezeLimit being excessively far in
the past.

As I said above, I don't think the "1.6 billion XIDs" argument has
merit, because it's so reasonable (and common) to set
autovacuum_freeze_max_age to something much larger.

You're also not acknowledging the benefit of avoiding uselessly
scanning the indexes again and again, which is mostly what would be
happening in this scenario. Maybe VACUUM shouldn't spin like this at
all, but that's not a new problem.

I explicitly said that there's benefits to skipping index scans?

Greetings,

Andres Freund

#128

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#127)

Re: New IndexAM API controlling index vacuum strategies

On Wed, Apr 14, 2021 at 8:38 PM Andres Freund <andres@anarazel.de> wrote:

The reason I didn't do further reviews for things in this thread was
that I was trying really hard to get the shared memory stats patch into
a committable shape - there were just not enough hours in the day. I
think it's to be expected that, during the final CF, there aren't a lot
of resources for reviewing patches that are substantially new. Why
should these new patches have gotten priority over a much older patch
set that also address significant operational issues?

We're all doing our best.

It's very common for larger / busier databases to *substantially*
increase autovacuum_freeze_max_age, so there won't be 1.6 billion XIDs
of headroom, but a few hundred million. The cost of doing unnecessary
anti-wraparound vacuums is just too great. And databases on the busier &
larger side of things are precisely the ones that are more likely to hit
wraparound issues (otherwise you're just not that likely to burn through
that many xids).

I think that this was once true, but is now much less common, mostly
due to the freeze map stuff in 9.6. And due a general recognition that
the *risk* of increasing them is just too great (a risk that we can
hope was diminished by the failsafe, incidentally). As an example of
this, Christophe Pettus had a Damascene conversion when it came to
increasing autovacuum_freeze_max_age aggressively, which we explains
here:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

In short, he went from regularly advising clients to increase
autovacuum_freeze_max_age to telling them to specifically advising
them to never touch them.

Even if we assume that I'm 100% wrong about autovacuum_freeze_max_age,
it's still true that the vacuum_failsafe_age GUC is interpreted with
reference to autovacuum_freeze_max_age -- it will always be
interpreted as if it was set to 105% of whatever the current value of
autovacuum_freeze_max_age happens to be (so it's symmetric with the
freeze_table_age GUC and its 95% behavior). So it's never completely
unreasonable in the sense that it directly clashes with an existing
autovacuum_freeze_max_age setting from before the upgrade.

Of course this doesn't mean that there couldn't possibly be any
problems with the new mechanism clashing with
autovacuum_freeze_max_age in some unforeseen way. But, the worst that
can happen is that a user that is sophisticated enough to very
aggressively increase autovacuum_freeze_max_age upgrades to Postgres
14, and then finds that index vacuuming is sometimes skipped. Which
they'll see lots of annoying and scary messages about if they ever
look in the logs. I think that that's an acceptable price to pay to
protect the majority of less sophisticated users.

And my concern isn't really that vacuum would have finished without a
problem if cost limiting hadn't been disabled, but that having multiple
autovacuum workers going all out will cause problems. Like the system
slowing down so much that it's hard to fix the actual root cause of the
wraparound - I've seen systems with a bunch unthrottled autovacuum
overwhelme the IO subsystem so much that simply opening a connection to
fix the issue took 10+ minutes. Especially on systems with provisioned
IO (i.e. just about all cloud storage) that's not too hard to hit.

I don't think that it's reasonable to expect an intervention like this
to perfectly eliminate all risk, while at the same time never
introducing any new theoretical risks. (Especially while also being
simple and obviously correct.)

If it's intrinsically impossible to advance relfrozenxid, then surely
all bets are off. But even in this scenario it's very unlikely that we
wouldn't at least do index vacuuming for those index tuples that are
dead and safe to delete according to the OldestXmin cutoff. You still
have 1.6 billion XIDs before the failsafe first kicks in, regardless
of the issue of the OldestXmin/FreezeLimit being excessively far in
the past.

As I said above, I don't think the "1.6 billion XIDs" argument has
merit, because it's so reasonable (and common) to set
autovacuum_freeze_max_age to something much larger.

No merit? Really? Not even a teeny, tiny, microscopic little bit of
merit? You're sure?

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

--
Peter Geoghegan

#129

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#128)

Re: New IndexAM API controlling index vacuum strategies

Hi,

On 2021-04-14 21:30:29 -0700, Peter Geoghegan wrote:

I think that this was once true, but is now much less common, mostly
due to the freeze map stuff in 9.6. And due a general recognition that
the *risk* of increasing them is just too great (a risk that we can
hope was diminished by the failsafe, incidentally). As an example of
this, Christophe Pettus had a Damascene conversion when it came to
increasing autovacuum_freeze_max_age aggressively, which we explains
here:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

Not at all convinced. The issue of needing to modify a lot of
all-visible pages again to freeze them is big enough to let it be a
problem even after the freeze map. Yes, there's workloads where it's
much less of a problem, but not all the time.

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

I didn't intend to say that the autovacuum_freeze_max_age would be set
much higher than 1.6 billion, just that that the headroom would be much
less. I've set it, and seen it set, to 1.5-1.8bio without problems,
while reducing overhead substantially.

Greetings,

Andres Freund

#130

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#129)

Re: New IndexAM API controlling index vacuum strategies

On Thu, Apr 15, 2021 at 5:12 PM Andres Freund <andres@anarazel.de> wrote:

https://thebuild.com/blog/2019/02/08/do-not-change-autovacuum-age-settings/

Not at all convinced. The issue of needing to modify a lot of
all-visible pages again to freeze them is big enough to let it be a
problem even after the freeze map. Yes, there's workloads where it's
much less of a problem, but not all the time.

Not convinced of what? I only claimed that it was much less common.
Many users live in fear of the extreme worst case of the database no
longer being able to accept writes. That is a very powerful fear.

As I said, we handle the case where autovacuum_freeze_max_age is set
to something larger than vacuum_failsafe_age is a straightforward and
pretty sensible way. I am curious, though: what
autovacuum_freeze_max_age setting is "much higher" than 1.6 billion,
but somehow also not extremely ill-advised and dangerous? What number
is that, precisely? Apparently this is common, but I must confess that
it's the first I've heard about it.

I didn't intend to say that the autovacuum_freeze_max_age would be set
much higher than 1.6 billion, just that that the headroom would be much
less. I've set it, and seen it set, to 1.5-1.8bio without problems,
while reducing overhead substantially.

Okay, that makes way more sense. (Though I still think that a
autovacuum_freeze_max_age beyond 1 billion is highly dubious.)

Let's say you set autovacuum_freeze_max_age to 1.8 billion (and you
really know what you're doing). This puts you in a pretty select group
of Postgres users -- the kind of select group that might be expected
to pay very close attention to the compatibility section of the
release notes. In any case it makes the failsafe kick in when
relfrozenxid age is 1.89 billion. Is that so bad? In fact, isn't this
feature actually pretty great for this select cohort of Postgres users
that live dangerously? Now it's far safer to live on the edge (perhaps
with some additional tuning that ought to be easy for this elite group
of users).

--
Peter Geoghegan